WO2018108013A1 - Medium displaying method and terminal - Google Patents

Medium displaying method and terminal Download PDF

Info

Publication number
WO2018108013A1
WO2018108013A1 PCT/CN2017/114843 CN2017114843W WO2018108013A1 WO 2018108013 A1 WO2018108013 A1 WO 2018108013A1 CN 2017114843 W CN2017114843 W CN 2017114843W WO 2018108013 A1 WO2018108013 A1 WO 2018108013A1
Authority
WO
WIPO (PCT)
Prior art keywords
pronunciation
target
phonetic
image
correspondence table
Prior art date
Application number
PCT/CN2017/114843
Other languages
French (fr)
Chinese (zh)
Inventor
张长帅
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2018108013A1 publication Critical patent/WO2018108013A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72484User interfaces specially adapted for cordless or mobile telephones wherein functions are triggered by incoming communication events
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72439User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for image or video messaging
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/725Cordless telephones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/74Details of telephonic subscriber devices with voice recognition means

Definitions

  • This document relates to, but is not limited to, the field of communication technologies, and in particular, to a media display method and terminal.
  • the mobile phone can enter the network environment through a SIM card or a wireless WIFI to realize a video call.
  • a media display method and terminal are provided in an embodiment of the present invention.
  • an embodiment of the present invention provides a media display method, including:
  • the at least one frame of the target media display image is played and displayed through the display interface.
  • the acquiring, according to the target call voice, the at least one frame target media display image that matches the target call voice includes:
  • the determining, according to the voiceprint feature of the target call voice, determining at least one target pronunciation type required to send the target call voice including:
  • Determining the at least one pronunciation port type is at least one target pronunciation port type required to issue the target call voice.
  • the determining, according to the voiceprint feature of the target call voice, determining at least one target pronunciation type required to send the target call voice including:
  • At least one corresponding pronunciation type is matched from the correspondence table between the phonetic and the pronunciation mouth type;
  • Determining the at least one pronunciation port type is at least one target pronunciation port type required to issue the target call voice.
  • the correspondence table between the text and the phonetic includes a correspondence table between Chinese characters and pinyin; and according to the text content, matching the phonetic corresponding to the text content from the correspondence table between the characters and the phonetic Combinations, including:
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and according to the phonetic combination, matching at least one pronunciation port from the correspondence table between the phonetic transcription and the pronunciation mouth shape Type, including:
  • At least one corresponding pronunciation type is matched from the correspondence table between the pinyin and the pronunciation type.
  • the correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence relationship between the initials and the finals in the pinyin and the pronunciation mouth shape; and according to the pinyin combination, from the correspondence table between the pinyin and the pronunciation mouth shape, Match the corresponding at least one pronunciation type, including:
  • the corresponding at least one pronunciation type is obtained from the correspondence between the initial and the final and the pronunciation.
  • the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; and according to the text content, matching the correspondence between the text and the phonetic to the text content Phonetic combination, including:
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and according to the phonetic transcription combination, matching at least one pronunciation corresponding from the correspondence table between the phonetic transcription and the pronunciation mouth shape Mouth type, including:
  • At least one corresponding pronunciation type is matched from the correspondence table between the English phonetic symbols and the pronunciation mouth shape.
  • the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels and the consonants and the pronunciation mouth shape in the English phonetic symbols; and the correspondence between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination
  • at least one pronunciation type corresponding to the matching including:
  • the matching obtains at least one corresponding mouth shape.
  • the determining a target emoticon packet corresponding to the target call voice includes:
  • the collecting the target call voice includes:
  • the method before the step of collecting the target call voice, the method further includes:
  • the personal image includes a facial image of a person, where the media material image includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol.
  • the personal image of the call contact with each of the media material images to generate at least one media display image, and obtaining an expression package including the at least one media display image, including:
  • the displaying, by the display interface, the at least one frame of the target media display image including:
  • the embodiment of the present invention further provides a media display terminal, including:
  • the acquisition module is configured to collect the target call voice
  • a first acquiring module configured to acquire, according to the target call voice, at least one frame target media display image that matches the target call voice
  • the display module is configured to perform play display on the at least one frame of the target media display image through the display interface.
  • the first acquiring module includes:
  • a first determining submodule configured to determine a target emoticon packet corresponding to the target call voice
  • a second determining submodule configured to determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice
  • Obtaining a sub-module configured to obtain at least one frame of the target media display image including the target pronunciation lip shape from the target emoticon package.
  • the second determining submodule includes:
  • a first determining unit configured to determine, according to the voiceprint feature of the target call voice, text content corresponding to the target call voice
  • the first matching unit is configured to match the corresponding pinyin combination from a correspondence table of a text and a pinyin according to the text content;
  • the second matching unit is configured to match the corresponding at least one pronunciation port type from a correspondence table between a pinyin and a pronunciation type according to the pinyin combination;
  • the second determining unit is configured to determine that the at least one pronunciation port type is at least one target pronunciation type required to issue the target call voice.
  • the correspondence table between the text and the phonetic includes a correspondence table between Chinese characters and pinyin;
  • the first matching unit includes:
  • the first matching subunit is configured to match, according to the text content, a pinyin combination corresponding to the text content from a correspondence table between Chinese characters and pinyin;
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and the second matching unit includes:
  • the second matching subunit is configured to match the corresponding at least one pronunciation lip shape from the correspondence table between the pinyin and the pronunciation lip shape according to the pinyin combination.
  • the correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; the second matching subunit is set as:
  • the corresponding at least one pronunciation type is obtained from the correspondence between the initial and the final and the pronunciation.
  • the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols;
  • the first matching unit includes:
  • the third matching subunit is configured to match the phonetic symbol corresponding to the text content from the correspondence table between the English and the English phonetic symbols according to the text content;
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and the second matching unit includes:
  • the fourth matching subunit is configured to match the corresponding at least one pronunciation port type from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.
  • the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels and the consonants and the pronunciation mouth shape in the English phonetic symbols; the fourth matching subunit is set as:
  • matching corresponds to at least one pronunciation mouth shape.
  • the first determining submodule includes:
  • a third determining unit configured to determine a target contact corresponding to the target call voice
  • the retrieval unit is configured to retrieve a target expression package pre-associated with the target contact.
  • the collecting module (301) includes:
  • the monitoring submodule is set to monitor the voice call process
  • the third determining submodule is configured to determine that the received counterpart voice is the target call voice.
  • the media display terminal further includes:
  • a second acquiring module configured to acquire a personal image of the asset resource package and the call contact, wherein the asset resource package includes at least one media material image
  • a generating module configured to integrate the personal image of the call contact with each of the media material images to generate at least one media display image, to obtain an emoticon package including the at least one media display image.
  • the personal image includes a facial image of the individual, where the media material image includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol.
  • the generating module includes:
  • a replacement module configured to fill and replace the voiced mouth image in the media material image in the mouth region
  • generating a sub-module configured to generate a media display image corresponding to each of the vocal-mouth images in the media material image, to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.
  • the display module includes:
  • the display sub-module is configured to obtain, by using the display interface, a obtained display image as a background image, and play and display the at least one frame of the target media display image.
  • the virtual video telephone communication can be broken through the network environment limitation. The process does not depend on the network, saves traffic or even gets rid of the traffic restriction, and makes the video picture accompanying virtualization in the call process, and the call process is more fresh. Live and smart, strengthen the communication effect, add communication fun, and prompt the user experience.
  • FIG. 1 is a schematic flow chart of a media display method in an embodiment of the present invention
  • FIG. 2 is a schematic flow chart of another media display method in an embodiment of the present invention.
  • FIG. 3 is a block diagram showing the structure of a media display terminal in an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of an operator template in an embodiment of the present invention.
  • the mobile phone can enter the network environment through SIM card or wireless WIFI to realize video call
  • the video call often needs to rely on wireless networking, and the actual situation is not wireless network coverage anytime and anywhere, and the video call is connected through the telephone card, which is expensive.
  • the method relies on the network environment and is limited to the constraints of the network environment. Video conversation cannot be the normal state of communication, and when people talk and communicate, only voice communication is not flexible and vivid, and the user experience is insufficient.
  • a media display method is disclosed in the embodiment of the present invention.
  • Step 101 Collect a target call voice.
  • This step may occur during a normal telephone call or during the playback of voice in an instant messaging software such as WeChat, and the target call voice may be the voice of the party in the call or the voice of the contact who is talking to himself.
  • the target call voice forms the next processing object.
  • the step of collecting the target call voice may include: monitoring a voice call process; and determining that the received party voice is the target call voice. That is, the call voice of the other party in the call is collected, and the next process is performed to realize the fun call.
  • Step 102 Acquire at least one frame target media display image that matches the target call voice according to the target call voice.
  • the at least one frame of the target media display image may be selected from a set of image packets.
  • the at least one frame of the target media display image may be within the conversation with the target call voice Matching the matching, or matching the sound tone, volume, or vocalization required by the target conversation voice, the target media display image may be an expression, or a physical motion, or a different symbol, or a voice The location and context of the content.
  • Step 103 Play and display the at least one frame of the target media display image through the display interface.
  • the limb motion may be switched with the voice.
  • the process matches at least one frame of the target media display image, and plays and displays the target media display image to form a video style play effect, thereby realizing the voice recognition to be converted into a video display, which can be performed by the terminal device.
  • the local software application mode breaks through the limitations of the network environment and communicates with virtual video phones. The process does not depend on the network, saves traffic and even gets rid of traffic restrictions, so that the video picture accompanying virtualization during the call, the call process is more lively and smart, and strengthens.
  • the communication effect adds communication fun and prompts the user experience.
  • a media display method is also disclosed in the embodiment of the present invention.
  • Step 201 Collect a target call voice.
  • the step of collecting the target call voice may include: monitoring a voice call process; and determining that the received party voice is the target call voice. That is, the call voice of the other party in the call is collected, and the next process is performed to realize the fun call.
  • Step 202 Determine a target emoticon packet corresponding to the target call voice.
  • the target emoticon package may be a fixed emoticon package that matches the different target call speech, and is directly determined and read from the storage device; or is targeted to be in accordance with the target call voice.
  • the expression pack that changes with a specific element needs to be matched and determined according to the target call voice.
  • the step of determining a target emoticon packet corresponding to the target call voice includes:
  • the emoticon package may be a resource content that is associated with the target contact, for example, when the contact is known to call the user's mobile phone, the user's mobile phone can obtain Knowing that the target call voice is sent by the contact, according to the address book information, adapting the emoticon packet corresponding to the caller's contact, which may be the photo of the target contact or a specific picture, and make different settings for the specific contact. To make the display results more targeted and more interesting.
  • Step 203 Determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice.
  • Different call voices will correspond to different voice mouth patterns, and different voices have different voiceprint feature information.
  • the desired voice pattern can be determined according to the voiceprint features in the collected target call voice.
  • the step of determining at least one target pronunciation type required to send the target call voice according to the voiceprint feature of the target call voice includes:
  • the phonetic transcription is the content of the pronunciation of the text, the different languages correspond to different characters, and the different characters correspond to different phonetic systems. For example, when the text is Chinese, the corresponding phonetic transcription is pinyin, when the text is English The corresponding phonetic is the English phonetic symbol. According to the voiceprint feature of the target call voice, the process of determining the corresponding voice mouth shape needs to first convert the voice into text, and match the corresponding phonetic combination through the text, thereby matching the target pronunciation type corresponding to the phonetic combination.
  • the process is implemented as follows: receiving voice, quantizing the voice data, and calling the open source interface to implement voice conversion text.
  • the principle is that different voices have different voiceprint feature information, and the voiceprint feature information and text comparison can be recorded in advance. Generate a database correspondence. After capturing the new voice, compare it with the pre-recorded database to find the corresponding text, and the text conversion phonetic process is also the same.
  • Pre-record the contrast between the different phonetic and corresponding texts write Array, generate the database correspondence, after the voice conversion text, and then find the phonetic into the database, and then from the corresponding table of the phonetic and pronunciation mouth type preset in the database, get the target pronunciation type corresponding to the phonetic, the phonetic will be Split and fit the emoticon image, quickly switch the display, and generate a virtual video effect.
  • the correspondence table between the character and the phonetic includes a correspondence table between the Chinese character and the pinyin;
  • the phonetic combination corresponding to the text content is matched, including:
  • a pinyin combination corresponding to the text content is matched from the correspondence table between the Chinese character and the pinyin.
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and according to the phonetic combination, matching at least one pronunciation port from the correspondence table between the phonetic transcription and the pronunciation mouth shape
  • the method includes: matching, according to the pinyin combination, the corresponding at least one pronunciation mouth shape from the correspondence table between the pinyin and the pronunciation mouth shape.
  • the process of converting Chinese characters into pinyin is also the same.
  • Pre-recording the relationship between Pinyin and Chinese characters reading the Chinese Pinyin table in the standard GBK character set database, writing the array, generating the database correspondence, after the speech conversion Chinese characters, the Chinese characters are compared with the database.
  • the target pronunciation type corresponding to the pinyin is obtained, the pinyin is split and adapted to the expression image, and the display is quickly switched to generate a virtual video effect.
  • the correspondence between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; and the correspondence between the pinyin and the pronunciation type is according to the pinyin combination.
  • the matching obtains at least one corresponding pronunciation type.
  • the pinyin is positioned and split in the array structure to separate the initials and finals, since some Chinese pronunciations only correspond to separate finals. Therefore, it is determined that the obtained pinyin combination may include a combination of initials and finals, or only a finals, and the emoticons are adapted according to the initials of the initials and the finals, and the display is switched in the display window, and the expressions are quickly switched to generate a virtual video presentation form.
  • the correspondence table between the pinyin and the pronunciation port type may also be generated by implementing import or download.
  • the storage device is provided with a lip-shaped resource library of Chinese pinyin letters.
  • each lip-shaped image in the library corresponds to one.
  • the pronunciation example of the pinyin letters, the mouth pattern and the initials and the finals in the pinyin table correspond one-to-one, so that each pronunciation can find the corresponding initial and final image combination in the mouth type library, so that the pronunciation mouth type is truly and actual.
  • the mouth of the person's mouth is consistent in appearance.
  • the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; And matching the phonetic combination corresponding to the text content from the correspondence table between the text and the phonetic, including:
  • a phonetic combination corresponding to the text content is matched from the correspondence table between the English and the English phonetic symbols.
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and according to the phonetic combination, the correspondence between the phonetic transcription and the pronunciation mouth shape is matched, at least A pronunciation mouth shape includes: matching at least one pronunciation mouth shape from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.
  • the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels in the English phonetic symbols and the consonants and the pronunciation mouth shape; the sound phone combination is based on the phonetic symbols and the pronunciation type
  • matching at least one pronunciation port type including:
  • the process is matched to the text content in the correspondence table between English and English phonetic symbols.
  • the English phonetic symbols are located in the array structure to locate the split vowels and consonants. Since some English pronunciations only correspond to individual vowels, the determined pinyin combinations may contain a combination of vowels and consonants, or It is only a vowel, which adapts the emoticon according to the vowel and consonant corresponding mouth shape, and performs switching display in the display window, and the expression quickly switches to generate a virtual video presentation form.
  • the correspondence table between the English phonetic symbols and the pronunciation mouth shape may also be generated by implementing import or download.
  • the solution is based on a mobile phone, and the virtual video is taken as an example.
  • the storage device is provided with a mouth-shaped resource library of English phonetic symbols.
  • each port image in the library corresponds to one.
  • the pronunciation example of the English phonetic symbol, the mouth pattern and the vowel consonants in the English phonetic table are in one-to-one correspondence, so that each pronunciation can find the corresponding vowel and consonant pronunciation image in the mouth type library, and the combination is associated to make the pronunciation port
  • the type is really consistent with the actual mouth of the caller.
  • Step 204 Match at least one frame of the target media display image including the target vocalization pattern from the target expression package.
  • the target emoticon package is an emoticon package containing different mouth shapes
  • the media display image included in the mouth emoticon package is also a display image with different shapes for the mouth type, and optionally, the target media display
  • the image can also contain other changing elements, such as eyebrow changes, expressions and face types, etc., to match the mouth shape.
  • Step 205 Perform play display on at least one frame of the target media display image through the display interface.
  • the user terminal for example, a mobile phone
  • the user terminal adapts the associated expression according to the voice recognition, and optimizes the corresponding image resource synthesis, so that the picture is continuously played with the voice, and the expression is continuously switched and updated.
  • the virtual video scene is generated in a network-free environment, and the scene of virtual display video dialogue is realized, so that the dialogue communication is more effective and interesting.
  • Step 1 Read the Chinese pinyin content in the standard Chinese character encoding character set (GBK) database in advance, and write the Chinese pinyin data into the array.
  • Each element of the array is a structure, and the structure is divided into four parts, pinyin and pinyin. The initials after splitting, the finals, and the Chinese characters corresponding to Pinyin.
  • Step 2 The voice is converted into text through the open source interface, and the text is searched in the array as a key, and the corresponding pinyin can be found, and then the pinchin is split into the corresponding initial and initials.
  • Step 3 The initials and the finals look up the associated expressions, create a visual dialog on the desktop as a user interface (UI) window, and present the expressions in the UI window.
  • UI user interface
  • the time length of the target call voice may be calculated first, and according to the time length of the target call voice, the target media display image is continuously played for each frame.
  • the playing time when displaying, the playing time of the playing time is changed continuously, and the length of the target call voice is divided by the number of frames of the target media display image, and the playing time of one frame of the image is obtained, to "Hello".
  • the voice "hello" time is 1 second
  • the voice is decomposed into four images corresponding to the expression, and each image is displayed for 1/4 second, that is, each image is displayed for 250 milliseconds, which is obtained according to the target call voice.
  • At least one frame of the target media display image that matches the target call voice is continuously switched and presented.
  • the expression presented by the window corresponds to the received voice.
  • the user's expression package is quickly switched in the UI window, that is, the video effect is presented, thereby realizing the virtual video call function.
  • the step of performing play display on the at least one frame of the target media display image by using the display interface includes: obtaining, by using the display interface, a obtained display image as a background image, and the at least one frame The target media displays an image for playback display.
  • a play background When playing a display of the target media display image, a play background may be added, the background may be a fixed background corresponding to the set contact, or may be based on a keyword or word recognized in the target call voice.
  • the matching acquires the corresponding image as a background image, so that the background image can be changed according to the user's voice content.
  • This process mainly performs voice matching, performs voice matching, and performs synthesis processing on known scene images (background images) in the library without a wireless network environment, thereby realizing virtual scene reproduction in a network-free environment.
  • scene images background images
  • the scenes and expressions that are constantly switched can realize the scene of virtual display video dialogue, making dialogue communication more effective and interesting.
  • the core implementation process is as follows. For example, when a call voice conversation is performed on a mobile phone, the mobile phones of both parties have pre-stored expressions and scene images. When receiving the voice of the other party, the local priority activates the voice recognition module, and the corresponding media display is switched by the recognition. Resources, displayed in the terminal dialogue interface, voice The voice recognition continuously adapts to the corresponding expression resources, and the terminal interface can display a fixed background image or adapt the corresponding background image according to the voice content, and the expression matches the background image in the terminal interface, and the visual effect of the fast switching is reality. The video conversation scene, thus realizing the virtual reality video dialogue.
  • the method before the step of collecting the target call voice, the method further includes:
  • the personal image may be a personal expression image or an image associated with the call contact.
  • the process corresponds to an initialization process of the emoticon packet containing the media display image to enable acquisition of at least one frame of the target media display image in accordance with the target call speech.
  • the asset package can be downloaded in advance when there is a network.
  • the process of integrating the personal image with each media material image may be implemented by, for example, image filling, partial replacement, partial coverage, and the like.
  • the user installs the virtual simulation software in the mobile phone, and the software can arbitrarily set the image scene, upload the user image, initialize the preset, and generate the user expression package.
  • the user's mobile phone stores the required image resources, which may be pre-camera or network download to the mobile phone local, usually including the user's personal image, the material resource package, and the scene picture of the regular video conversation.
  • the material resource package is, for example, a lip-type resource package, where the mouth is The type resource package is integrated by the software of the embodiment of the present invention and provides user use. Taking the mouth resource package as an example, the software itself integrates the lip image resource.
  • the user provides a personal image before use, initializes the software, initializes the lip image and the personal image, and performs image optimization to synthesize the lip package.
  • an emoticon packet corresponding to the pinyin alphabet of the user is generated.
  • the image synthesis technology first clears the lip and edge regions of the user's face image, and superimposes the same size of the lip-shaped resources, and then optimizes the image to obtain a user-defined emoticon packet corresponding to the pinyin alphabet.
  • the personal image includes a facial image of the individual
  • the media material image includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol.
  • integrating the personal image of the call contact with each of the media material images to generate at least one media display image, and obtaining at least one media display The steps of the image emoticon package include:
  • the media corresponding to the image displays an image, and an emoticon packet including a media display image corresponding to each of the vocal vocal images is obtained.
  • the technical implementation steps of the face and mouth synthesis expression package are as follows:
  • Step 1 Taking a call between two parties, for example, in the voice communication device held by A, the image resource of B is stored in advance, and the lip image resource corresponding to all the alphabetic pronunciation of the Chinese pinyin alphabet is stored in advance.
  • Step 2 Convert the face and mouth color image into a grayscale image.
  • Gray R*0.299+G*0.587+B*0.114, in order to avoid low-speed floating-point operations, The integer algorithm is introduced and rounded off to obtain an equivalent variant algorithm.
  • Gray (R*30+G*59+B*11+50)/100, which improves the computational conversion efficiency.
  • the grayscale image takes the grayscale threshold, uses the threshold to segment the face, implements the mouth region detection, performs edge detection on the face image, and uses the mean operator template (ie, each pixel value is calculated and updated to the adjacent pixel).
  • Mean processing the grayscale image, can detect the feature area of the face, such as the eye area and the like, or identify the mouth area according to the symmetry and structure distribution of the face.
  • Step 4 Replace the original lip type resource with the lip area detected in the face area, generate an expression, and quantize the pixel value of the lip image generated in step 1 to make the number of pixels and step 3
  • the number of pixels in the lip-shaped area is the same, and the sampled and quantized lip-shaped resource is replaced and filled into the lip-shaped area of step 3, and the generated facial expression image is reconstructed.
  • Step 5 Perform Gaussian filtering on the newly generated expression image, enhance the smoothness of the synthesized image, and generate an expression resource library.
  • Each expression image in the library is a pronunciation type of the face type corresponding to the alphabet.
  • the face correspondence and the different mouth shapes are combined to generate different expression images, and each image corresponds to the pronunciation of the pinyin letters.
  • the newly generated facial expression image needs to be denoised and smoothed by Gaussian filtering to obtain a clear expression image.
  • the template calculated by the Gaussian function is a floating-point number.
  • an integer 5 ⁇ 5 template operator is used, and the coefficient is 1/273. As shown in Figure 4.
  • the process according to the collected call voice, at least one frame of the target media display image is matched, and the target media display images are continuously played and displayed to form a video style playing effect, and the voice recognition is converted into a video display, which can be adopted.
  • the local software application mode of the terminal device breaks through the limitation of the network environment and performs virtual video telephone communication. The process does not depend on the network, saves traffic and even gets rid of the traffic restriction, so that the video picture accompanying virtualization during the call is more vivid and smart. It enhances the communication effect and adds communication fun.
  • the embodiment of the present invention further discloses a media display terminal, which is shown in FIG. 3, and includes an acquisition module 301, a first acquisition module 302, and a display module 303.
  • the media display terminal may be a terminal such as a smart watch or a mobile phone that supports voice.
  • the collecting module 301 is configured to collect a target call voice.
  • the first obtaining module 302 is configured to acquire, according to the target call voice, at least one frame target media display image that matches the target call voice.
  • the display module 303 is configured to perform play display on the at least one frame of the target media display image through the display interface.
  • the first acquiring module includes:
  • the first determining submodule is configured to determine a target emoticon packet corresponding to the target call voice.
  • the second determining submodule is configured to determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice.
  • Obtaining a sub-module configured to obtain at least one frame of the target media display image including the target pronunciation lip shape from the target emoticon package.
  • the second determining submodule includes:
  • the first determining unit is configured to determine the text content corresponding to the target call voice according to the voiceprint feature of the target call voice.
  • the first matching unit is configured to match the phonetic combination corresponding to the text content from the correspondence table between the text and the phonetic according to the text content.
  • a second matching unit configured to correspond to the phonetic and pronunciation type according to the phonetic combination In the table, at least one corresponding pronunciation type is matched.
  • the second determining unit is configured to determine that the at least one pronunciation port type is at least one target pronunciation type required to issue the target call voice.
  • the correspondence table between the characters and the phonetic includes a correspondence table between Chinese characters and pinyin; the first matching unit includes:
  • the first matching subunit is configured to match the pinyin combination corresponding to the text content from the correspondence table between the Chinese character and the pinyin according to the text content.
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and the second matching unit includes:
  • the second matching subunit is configured to match the corresponding at least one pronunciation lip shape from the correspondence table between the pinyin and the pronunciation lip shape according to the pinyin combination.
  • the correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; the second matching subunit is set as:
  • the matching obtains at least one corresponding pronunciation type.
  • the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; the first matching unit includes:
  • the third matching subunit is configured to match the phonetic symbol corresponding to the text content from the correspondence table between the English and the English phonetic symbols according to the text content.
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and the second matching unit includes:
  • the fourth matching subunit is configured to match the corresponding at least one pronunciation port type from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.
  • the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels in the English phonetic symbols and the consonants and the pronunciation mouth shape; the fourth matching subunit is set as:
  • the first determining submodule includes:
  • the third determining unit is configured to determine a target contact corresponding to the target call voice.
  • the retrieval unit is configured to retrieve a target expression package pre-associated with the target contact.
  • the collection module includes:
  • the monitor submodule is set to listen to the voice call process.
  • the third determining submodule is configured to determine that the received counterpart voice is the target call voice.
  • the terminal further includes:
  • the second obtaining module is configured to obtain a personal image of the asset resource package and the call contact, wherein the asset resource package includes at least one media material image.
  • a generating module configured to integrate the personal image of the call contact with each of the media material images to generate at least one media display image, to obtain an emoticon package including the at least one media display image.
  • the personal image includes a facial image of a person
  • the image of the media material includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol.
  • the generation module includes:
  • An identification sub-module is provided to identify a mouth region in the facial image.
  • a replacement module configured to fill and replace the voiced mouth image in the media material image in the mouth region.
  • generating a sub-module configured to generate a media display image corresponding to each of the vocal-mouth images in the media material image, to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.
  • the display module includes:
  • the display sub-module is configured to obtain, by using the display interface, a obtained display image as a background image, and play and display the at least one frame of the target media display image.
  • the media display terminal matches at least one frame of the target media display image according to the collected call voice, and performs continuous play display on the target media display images to form a video style play effect, and realizes voice recognition to be converted into a video display.
  • the virtual video telephone communication can be broken through the network environment limitation. The process does not depend on the network, saves traffic and even gets rid of the traffic restriction, so that the virtualized video picture is accompanied during the call, and the call process is more vivid. Smart, enhanced communication, and increased communication fun.
  • the embodiment of the invention further provides a computer readable storage medium storing computer executable instructions, which are implemented by the processor to implement the method described in the foregoing embodiments.
  • computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules, or other data. , removable and non-removable media.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridge, magnetic tape, magnetic disk storage or other magnetic storage device, or may Any other medium used to store the desired information and that can be accessed by the computer.
  • communication media typically comprise computer readable instructions, data structures, program modules or such as a carrier wave or He transmits other data in the data signal, such as a transmission mechanism, and may include any information delivery medium.
  • relational terms such as first and second, etc. are merely used to distinguish one entity or operation from another entity or operation, without necessarily requiring or Imply that there is any such actual relationship or order between these entities or operations.
  • the terms “comprises” or “comprising” or “comprising” or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, or terminal device that includes a plurality of elements includes not only those elements but also Other elements that are included, or include elements inherent to such a process, method, article, or terminal device.
  • An element defined by the phrase “comprising a " does not exclude the presence of additional identical elements in the process, method, article, or terminal device that comprises the element, without further limitation.
  • the embodiment of the present invention breaks through the limitation of the network environment and performs virtual video telephone communication.
  • the process does not depend on the network, saves traffic and even gets rid of the traffic restriction, so that the video picture accompanying virtualization during the call process is more vivid and smart, and the communication is strengthened.
  • the effect adds fun to the communication and prompts the user experience.

Abstract

A medium displaying method and terminal. The method comprises: acquiring a target call voice (101); acquiring, according to the target call voice, at least one target medium display image frame matching the target call voice (102); and displaying, via a display interface, the at least one target medium display image frame (103).

Description

一种媒体显示方法及终端Media display method and terminal 技术领域Technical field
本文涉及但不限于通信技术领域,尤其涉及一种媒体显示方法及终端。This document relates to, but is not limited to, the field of communication technologies, and in particular, to a media display method and terminal.
背景技术Background technique
人们在通讯电话过程中,传统功能手机只能通过语音进行通讯传递,传统通话没有形象具体的视频场景,声音通信远不如视频通信更加形象具体深刻。In the process of communication, traditional mobile phones can only communicate by voice. Traditional calls do not have specific video scenes, and voice communication is far less detailed and more profound than video communication.
手机可通过SIM卡或无线WIFI等进入网络环境,实现视频通话。The mobile phone can enter the network environment through a SIM card or a wireless WIFI to realize a video call.
发明内容Summary of the invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.
本发明实施例中提供一种媒体显示方法及终端。A media display method and terminal are provided in an embodiment of the present invention.
本发明实施例采用如下技术方案:The embodiment of the invention adopts the following technical solutions:
一方面,本发明实施例提供一种媒体显示方法,包括:In one aspect, an embodiment of the present invention provides a media display method, including:
采集目标通话语音;Collect target call voice;
根据所述目标通话语音,获取与所述目标通话语音匹配的至少一帧目标媒体显示图像;Acquiring at least one frame of the target media display image that matches the target call voice according to the target call voice;
通过显示界面对所述至少一帧目标媒体显示图像进行播放显示。The at least one frame of the target media display image is played and displayed through the display interface.
可选地,所述根据所述目标通话语音,获取与所述目标通话语音匹配的至少一帧目标媒体显示图像,包括:Optionally, the acquiring, according to the target call voice, the at least one frame target media display image that matches the target call voice, includes:
确定与所述目标通话语音对应的目标表情包;Determining a target emoticon packet corresponding to the target call voice;
依据所述目标通话语音的声纹特征,确定发出所述目标通话语音所需的至少一个目标发音口型; Determining at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice;
从所述目标表情包中匹配得到包含所述目标发音口型的至少一帧目标媒体显示图像。And matching at least one frame of the target media display image including the target vocalization pattern from the target expression package.
可选地,所述依据所述目标通话语音的声纹特征,确定发出所述目标通话语音所需的至少一个目标发音口型,包括:Optionally, the determining, according to the voiceprint feature of the target call voice, determining at least one target pronunciation type required to send the target call voice, including:
依据所述目标通话语音的声纹特征,确定与所述目标通话语音对应的文字内容;Determining text content corresponding to the target call voice according to the voiceprint feature of the target call voice;
依据所述文字内容,从一文字与拼音的对应关系表中,匹配对应的拼音组合;According to the text content, matching a corresponding pinyin combination from a correspondence table between a letter and a pinyin;
依据所述拼音组合,从一拼音与发音口型的对应关系表中,匹配对应的至少一个发音口型;According to the pinyin combination, matching at least one pronunciation mouth shape from a correspondence table of a pinyin and a pronunciation mouth type;
确定所述至少一个发音口型为发出所述目标通话语音所需的至少一个目标发音口型。Determining the at least one pronunciation port type is at least one target pronunciation port type required to issue the target call voice.
可选地,所述依据所述目标通话语音的声纹特征,确定发出所述目标通话语音所需的至少一个目标发音口型,包括:Optionally, the determining, according to the voiceprint feature of the target call voice, determining at least one target pronunciation type required to send the target call voice, including:
依据所述目标通话语音的声纹特征,确定与所述目标通话语音对应的文字内容;Determining text content corresponding to the target call voice according to the voiceprint feature of the target call voice;
依据所述文字内容,从文字与注音的对应关系表中,匹配与所述文字内容对应的注音组合;And matching, according to the text content, a phonetic combination corresponding to the text content from a correspondence table between text and phonetic;
依据所述注音组合,从注音与发音口型的对应关系表中,匹配对应的至少一个发音口型;According to the phonetic combination, at least one corresponding pronunciation type is matched from the correspondence table between the phonetic and the pronunciation mouth type;
确定所述至少一个发音口型为发出所述目标通话语音所需的至少一个目标发音口型。Determining the at least one pronunciation port type is at least one target pronunciation port type required to issue the target call voice.
可选地,所述文字与注音的对应关系表中包括汉字与拼音的对应关系表;所述依据所述文字内容,从文字与注音的对应关系表中,匹配与所述文字内容对应的注音组合,包括:Optionally, the correspondence table between the text and the phonetic includes a correspondence table between Chinese characters and pinyin; and according to the text content, matching the phonetic corresponding to the text content from the correspondence table between the characters and the phonetic Combinations, including:
根据所述文字内容,从汉字与拼音的对应关系表中,匹配与所述文字内容对应的拼音组合; And matching, according to the text content, a pinyin combination corresponding to the text content from a correspondence table between Chinese characters and pinyin;
所述注音与发音口型的对应关系表中包括拼音与发音口型的对应关系表,所述依据所述注音组合,从注音与发音口型的对应关系表中,匹配对应的至少一个发音口型,包括:The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and according to the phonetic combination, matching at least one pronunciation port from the correspondence table between the phonetic transcription and the pronunciation mouth shape Type, including:
依据所述拼音组合,从拼音与发音口型的对应关系表中,匹配对应的至少一个发音口型。According to the pinyin combination, at least one corresponding pronunciation type is matched from the correspondence table between the pinyin and the pronunciation type.
可选地,所述拼音与发音口型的对应关系表中包括拼音中声母及韵母与发音口型的对应关系;所述依据所述拼音组合,从拼音与发音口型的对应关系表中,匹配对应的至少一个发音口型,包括:Optionally, the correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence relationship between the initials and the finals in the pinyin and the pronunciation mouth shape; and according to the pinyin combination, from the correspondence table between the pinyin and the pronunciation mouth shape, Match the corresponding at least one pronunciation type, including:
根据所述拼音组合,确定所述拼音组合中所包含的声母和韵母,或者,确定所述拼音组合中所包含的韵母;Determining, according to the pinyin combination, an initial and a final included in the phonetic combination, or determining a final included in the pinyin combination;
依据所述声母和韵母,或者,依据所述韵母,从所述声母及韵母与发音口型的对应关系中,匹配得到相对应的至少一个发音口型。According to the initial and the final, or according to the final, the corresponding at least one pronunciation type is obtained from the correspondence between the initial and the final and the pronunciation.
可选地,所述文字与注音的对应关系表中包括英文与英语音标的对应关系表;所述依据所述文字内容,从文字与注音的对应关系表中,匹配与所述文字内容对应的注音组合,包括:Optionally, the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; and according to the text content, matching the correspondence between the text and the phonetic to the text content Phonetic combination, including:
根据所述文字内容,从英文与英语音标的对应关系表中,匹配与所述文字内容对应的音标组合;And matching, according to the text content, a phonetic symbol combination corresponding to the text content from a correspondence table between English and English phonetic symbols;
所述注音与发音口型的对应关系表中包括英语音标与发音口型的对应关系表,所述依据所述注音组合,从注音与发音口型的对应关系表中,匹配对应的至少一个发音口型,包括:The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and according to the phonetic transcription combination, matching at least one pronunciation corresponding from the correspondence table between the phonetic transcription and the pronunciation mouth shape Mouth type, including:
依据所述音标组合,从英语音标与发音口型的对应关系表中,匹配对应的至少一个发音口型。According to the phonetic symbol combination, at least one corresponding pronunciation type is matched from the correspondence table between the English phonetic symbols and the pronunciation mouth shape.
可选地,所述英语音标与发音口型的对应关系表中包括英语音标中元音及辅音与发音口型的对应关系;所述依据所述音标组合,从英语音标与发音口型的对应关系表中,匹配对应的至少一个发音口型,包括:Optionally, the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels and the consonants and the pronunciation mouth shape in the English phonetic symbols; and the correspondence between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination In the relationship table, at least one pronunciation type corresponding to the matching, including:
根据所述音标组合,确定所述音标组合中所包含的元音和辅音,或者,确定所述音标组合中所包含的元音;Determining a vowel and a consonant included in the phonetic symbol combination according to the phonetic symbol combination, or determining a vowel contained in the phonetic symbol combination;
依据所述元音和辅音,或者,依据所述元音,从所述元音及辅音与发音 口型的对应关系中,匹配得到相对应的至少一个发音口型。According to the vowel and consonant, or according to the vowel, from the vowel and consonant and pronunciation In the correspondence of the mouth shape, the matching obtains at least one corresponding mouth shape.
可选地,所述确定与所述目标通话语音对应的目标表情包,包括:Optionally, the determining a target emoticon packet corresponding to the target call voice includes:
确定所述目标通话语音所对应的目标联系人;Determining a target contact corresponding to the target call voice;
调取与所述目标联系人预关联的目标表情包。Retrieving a target emoticon package pre-associated with the target contact.
可选地,所述采集目标通话语音,包括:Optionally, the collecting the target call voice includes:
监听语音通话进程;Monitor the voice call process;
确定接收到的对方通话语音为所述目标通话语音。Determining the received call voice of the other party as the target call voice.
可选地,所述采集目标通话语音的步骤之前,还包括:Optionally, before the step of collecting the target call voice, the method further includes:
获取素材资源包及通话联系人的个人图像,其中,所述素材资源包中包含至少一个媒体素材图像;Obtaining a personal image of the asset package and the call contact, wherein the asset package includes at least one media material image;
将所述通话联系人的个人图像与每一所述媒体素材图像进行整合,生成至少一个媒体显示图像,得到包含所述至少一个媒体显示图像的表情包。Integrating the personal image of the call contact with each of the media material images to generate at least one media display image to obtain an emoticon package including the at least one media display image.
可选地,所述个人图像中包括个人的脸部图像,所述媒体素材图像中包括拼音中声母及韵母所对应的发音口型图像或英语音标中元音及辅音所对应的发音口型图像,所述将所述通话联系人的个人图像与每一所述媒体素材图像进行整合,生成至少一个媒体显示图像,得到包含所述至少一个媒体显示图像的表情包,包括:Optionally, the personal image includes a facial image of a person, where the media material image includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol. And integrating the personal image of the call contact with each of the media material images to generate at least one media display image, and obtaining an expression package including the at least one media display image, including:
识别所述脸部图像中的嘴部区域;Identifying a mouth region in the facial image;
将所述媒体素材图像中的发音口型图像在所述嘴部区域进行填充替换;Filling and replacing the voiced mouth image in the media material image in the mouth region;
生成与所述媒体素材图像中的每一所述发音口型图像对应的媒体显示图像,得到包含与每一所述发音口型图像对应的媒体显示图像的表情包。Generating a media display image corresponding to each of the vocalization-type images in the media material image to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.
可选地,所述通过显示界面对所述至少一帧目标媒体显示图像进行播放显示,包括:Optionally, the displaying, by the display interface, the at least one frame of the target media display image, including:
通过所述显示界面,以获取得到的一显示图像为背景画面,对所述至少一帧目标媒体显示图像进行播放显示。And displaying, by using the display interface, the obtained display image as a background image, and displaying and displaying the at least one frame of the target media display image.
另一方面,本发明实施例还提供一种媒体显示终端,包括:In another aspect, the embodiment of the present invention further provides a media display terminal, including:
采集模块,设置为采集目标通话语音; The acquisition module is configured to collect the target call voice;
第一获取模块,设置为根据所述目标通话语音,获取与所述目标通话语音匹配的至少一帧目标媒体显示图像;a first acquiring module, configured to acquire, according to the target call voice, at least one frame target media display image that matches the target call voice;
显示模块,设置为通过显示界面对所述至少一帧目标媒体显示图像进行播放显示。The display module is configured to perform play display on the at least one frame of the target media display image through the display interface.
可选地,所第一获取模块,包括:Optionally, the first acquiring module includes:
第一确定子模块,设置为确定与所述目标通话语音对应的目标表情包;a first determining submodule, configured to determine a target emoticon packet corresponding to the target call voice;
第二确定子模块,设置为依据所述目标通话语音的声纹特征,确定发出所述目标通话语音所需的至少一个目标发音口型;a second determining submodule, configured to determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice;
获取子模块,设置为从所述目标表情包中匹配得到包含所述目标发音口型的至少一帧目标媒体显示图像。Obtaining a sub-module, configured to obtain at least one frame of the target media display image including the target pronunciation lip shape from the target emoticon package.
可选地,所述第二确定子模块,包括:Optionally, the second determining submodule includes:
第一确定单元,设置为依据所述目标通话语音的声纹特征,确定与所述目标通话语音对应的文字内容;a first determining unit, configured to determine, according to the voiceprint feature of the target call voice, text content corresponding to the target call voice;
第一匹配单元,设置为依据所述文字内容,从一文字与拼音的对应关系表中,匹配对应的拼音组合;The first matching unit is configured to match the corresponding pinyin combination from a correspondence table of a text and a pinyin according to the text content;
第二匹配单元,设置为依据所述拼音组合,从一拼音与发音口型的对应关系表中,匹配对应的至少一个发音口型;The second matching unit is configured to match the corresponding at least one pronunciation port type from a correspondence table between a pinyin and a pronunciation type according to the pinyin combination;
第二确定单元,设置为确定所述至少一个发音口型为发出所述目标通话语音所需的至少一个目标发音口型。The second determining unit is configured to determine that the at least one pronunciation port type is at least one target pronunciation type required to issue the target call voice.
可选的,所述文字与注音的对应关系表中包括汉字与拼音的对应关系表;所述第一匹配单元包括:Optionally, the correspondence table between the text and the phonetic includes a correspondence table between Chinese characters and pinyin; the first matching unit includes:
第一匹配子单元,设置为根据所述文字内容,从汉字与拼音的对应关系表中,匹配与所述文字内容对应的拼音组合;The first matching subunit is configured to match, according to the text content, a pinyin combination corresponding to the text content from a correspondence table between Chinese characters and pinyin;
所述注音与发音口型的对应关系表中包括拼音与发音口型的对应关系表,所述第二匹配单元包括:The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and the second matching unit includes:
第二匹配子单元,设置为依据所述拼音组合,从拼音与发音口型的对应关系表中,匹配对应的至少一个发音口型。 The second matching subunit is configured to match the corresponding at least one pronunciation lip shape from the correspondence table between the pinyin and the pronunciation lip shape according to the pinyin combination.
可选的,所述拼音与发音口型的对应关系表中包括拼音中声母及韵母与发音口型的对应关系;所述第二匹配子单元是设置为:Optionally, the correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; the second matching subunit is set as:
根据所述拼音组合,确定所述拼音组合中所包含的声母和韵母,或者,确定所述拼音组合中所包含的韵母;Determining, according to the pinyin combination, an initial and a final included in the phonetic combination, or determining a final included in the pinyin combination;
依据所述声母和韵母,或者,依据所述韵母,从所述声母及韵母与发音口型的对应关系中,匹配得到相对应的至少一个发音口型。According to the initial and the final, or according to the final, the corresponding at least one pronunciation type is obtained from the correspondence between the initial and the final and the pronunciation.
可选的,所述文字与注音的对应关系表中包括英文与英语音标的对应关系表;所述第一匹配单元包括:Optionally, the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; the first matching unit includes:
第三匹配子单元,设置为根据所述文字内容,从英文与英语音标的对应关系表中,匹配与所述文字内容对应的音标组合;The third matching subunit is configured to match the phonetic symbol corresponding to the text content from the correspondence table between the English and the English phonetic symbols according to the text content;
所述注音与发音口型的对应关系表中包括英语音标与发音口型的对应关系表,所述第二匹配单元包括:The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and the second matching unit includes:
第四匹配子单元,设置为依据所述音标组合,从英语音标与发音口型的对应关系表中,匹配对应的至少一个发音口型。The fourth matching subunit is configured to match the corresponding at least one pronunciation port type from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.
可选的,所述英语音标与发音口型的对应关系表中包括英语音标中元音及辅音与发音口型的对应关系;所述第四匹配子单元是设置为:Optionally, the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels and the consonants and the pronunciation mouth shape in the English phonetic symbols; the fourth matching subunit is set as:
根据所述音标组合,确定所述音标组合中所包含的元音和辅音,或者,确定所述音标组合中所包含的元音;Determining a vowel and a consonant included in the phonetic symbol combination according to the phonetic symbol combination, or determining a vowel contained in the phonetic symbol combination;
依据所述元音和辅音,或者,依据所述元音,从所述元音及辅音与发音口型的对应关系中,匹配得到相对应的至少一个发音口型。According to the vowel and the consonant, or according to the vowel, from the correspondence between the vowel and the consonant and the pronunciation mouth shape, matching corresponds to at least one pronunciation mouth shape.
可选的,所述第一确定子模块,包括:Optionally, the first determining submodule includes:
第三确定单元,设置为确定所述目标通话语音所对应的目标联系人;a third determining unit, configured to determine a target contact corresponding to the target call voice;
调取单元,设置为调取与所述目标联系人预关联的目标表情包。The retrieval unit is configured to retrieve a target expression package pre-associated with the target contact.
可选的,所述采集模块(301),包括:Optionally, the collecting module (301) includes:
监听子模块,设置为监听语音通话进程;The monitoring submodule is set to monitor the voice call process;
第三确定子模块,设置为确定接收到的对方通话语音为所述目标通话语音。 The third determining submodule is configured to determine that the received counterpart voice is the target call voice.
可选的,所述的媒体显示终端,还包括:Optionally, the media display terminal further includes:
第二获取模块,设置为获取素材资源包及通话联系人的个人图像,其中,所述素材资源包中包含至少一个媒体素材图像;a second acquiring module, configured to acquire a personal image of the asset resource package and the call contact, wherein the asset resource package includes at least one media material image;
生成模块,设置为将所述通话联系人的个人图像与每一所述媒体素材图像进行整合,生成至少一个媒体显示图像,得到包含所述至少一个媒体显示图像的表情包。And a generating module, configured to integrate the personal image of the call contact with each of the media material images to generate at least one media display image, to obtain an emoticon package including the at least one media display image.
可选的,所述个人图像中包括个人的脸部图像,所述媒体素材图像中包括拼音中声母及韵母所对应的发音口型图像或英语音标中元音及辅音所对应的发音口型图像,所述生成模块,包括:Optionally, the personal image includes a facial image of the individual, where the media material image includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol. The generating module includes:
识别子模块,设置为识别所述脸部图像中的嘴部区域;Identifying a sub-module, configured to identify a mouth region in the facial image;
替换模块,设置为将所述媒体素材图像中的发音口型图像在所述嘴部区域进行填充替换;a replacement module, configured to fill and replace the voiced mouth image in the media material image in the mouth region;
生成子模块,设置为生成与所述媒体素材图像中的每一所述发音口型图像对应的媒体显示图像,得到包含与每一所述发音口型图像对应的媒体显示图像的表情包。And generating a sub-module, configured to generate a media display image corresponding to each of the vocal-mouth images in the media material image, to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.
可选的,所述显示模块,包括:Optionally, the display module includes:
显示子模块,设置为通过所述显示界面,以获取得到的一显示图像为背景画面,对所述至少一帧目标媒体显示图像进行播放显示。The display sub-module is configured to obtain, by using the display interface, a obtained display image as a background image, and play and display the at least one frame of the target media display image.
本发明实施例具有以下有益效果:Embodiments of the present invention have the following beneficial effects:
本发明实施例中,根据采集到的通话语音,匹配得到至少一帧目标媒体显示图像,对该些目标媒体显示图像进行连续的播放显示,形成视频样式的播放效果,实现语音识别转换为视频显示,可以通过终端设备的本地软件应用模式,突破网络环境限制,进行虚拟视频电话交流,该过程不依赖网络,节约流量甚至摆脱流量限制,使通话过程中伴随虚拟化的视频画面,通话过程更加鲜活、灵动,加强了通信效果,增添了通信乐趣,提示用户体验。In the embodiment of the present invention, according to the collected call voice, at least one frame of the target media display image is matched, and the target media display images are continuously played and displayed to form a video style playing effect, and the voice recognition is converted into a video display. Through the local software application mode of the terminal device, the virtual video telephone communication can be broken through the network environment limitation. The process does not depend on the network, saves traffic or even gets rid of the traffic restriction, and makes the video picture accompanying virtualization in the call process, and the call process is more fresh. Live and smart, strengthen the communication effect, add communication fun, and prompt the user experience.
在阅读并理解了附图和详细描述后,可以明白其他方面。Other aspects will be apparent upon reading and understanding the drawings and detailed description.
附图概述 BRIEF abstract
图1表示本发明实施例中媒体显示方法的流程示意图;1 is a schematic flow chart of a media display method in an embodiment of the present invention;
图2表示本发明实施例中另一种媒体显示方法的流程示意图;2 is a schematic flow chart of another media display method in an embodiment of the present invention;
图3表示本发明实施例中媒体显示终端的结构框图;3 is a block diagram showing the structure of a media display terminal in an embodiment of the present invention;
图4表示本发明实施例中的算子模板示意图。FIG. 4 is a schematic diagram of an operator template in an embodiment of the present invention.
本发明的实施方式Embodiments of the invention
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are a part of the embodiments of the present invention, but not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
手机虽然可通过SIM卡或无线WIFI等进入网络环境,实现视频通话,但视频通话往往需要依赖于无线联网,而实际情况并非随时随地都有无线网络覆盖,通过电话卡联网视频通话,成本昂贵,该方式依赖网络环境,局限于网络环境的约束,视频对话无法成为通信常态,而人们通话交流时,仅仅语音交流不够灵活生动,用户体验不足。Although the mobile phone can enter the network environment through SIM card or wireless WIFI to realize video call, the video call often needs to rely on wireless networking, and the actual situation is not wireless network coverage anytime and anywhere, and the video call is connected through the telephone card, which is expensive. The method relies on the network environment and is limited to the constraints of the network environment. Video conversation cannot be the normal state of communication, and when people talk and communicate, only voice communication is not flexible and vivid, and the user experience is insufficient.
本发明实施例中公开一种媒体显示方法,结合图1所示,包括:A media display method is disclosed in the embodiment of the present invention.
步骤101:采集目标通话语音。Step 101: Collect a target call voice.
该步骤可发生于正常的电话通话过程中或者是微信等即时通讯软件中语音的播放过程中,该目标通话语音可以是通话中的己方的语音或者是与自己进行通话的联系人的声音,该目标通话语音形成接下来的处理对象。This step may occur during a normal telephone call or during the playback of voice in an instant messaging software such as WeChat, and the target call voice may be the voice of the party in the call or the voice of the contact who is talking to himself. The target call voice forms the next processing object.
当应用于正常的电话通话过程中时,作为一个实施方式,该采集目标通话语音的步骤,可以包括:监听语音通话进程;确定接收到的对方通话语音为该目标通话语音。即,将通话中的对方联系人的通话语音进行采集,进行接下来的处理过程,实现趣味通话。When applied to a normal telephone call, as an embodiment, the step of collecting the target call voice may include: monitoring a voice call process; and determining that the received party voice is the target call voice. That is, the call voice of the other party in the call is collected, and the next process is performed to realize the fun call.
步骤102:根据目标通话语音,获取与目标通话语音匹配的至少一帧目标媒体显示图像。Step 102: Acquire at least one frame target media display image that matches the target call voice according to the target call voice.
其中,该至少一帧目标媒体显示图像可以是从设定好的一组图像包里进行匹配选取。该至少一帧目标媒体显示图像可以是与目标通话语音的通话内 容相匹配,或者是与目标通话语音的发声音调、音量或者发声所需口型相匹配,该目标媒体显示图像可以是表情、或者是肢体动作、或者是高低不同的符号、或者是与语音内容对应的地点及背景环境。The at least one frame of the target media display image may be selected from a set of image packets. The at least one frame of the target media display image may be within the conversation with the target call voice Matching the matching, or matching the sound tone, volume, or vocalization required by the target conversation voice, the target media display image may be an expression, or a physical motion, or a different symbol, or a voice The location and context of the content.
步骤103:通过显示界面对该至少一帧目标媒体显示图像进行播放显示。Step 103: Play and display the at least one frame of the target media display image through the display interface.
对应地,该对至少一帧目标媒体显示图像进行连续的播放显示过程中,除了表情随语音不同而变化外,还可以是肢体动作随着语音不同进行切换。Correspondingly, in the continuous play display process of the at least one frame of the target media display image, in addition to the change of the expression with the voice, the limb motion may be switched with the voice.
该过程,根据采集到的通话语音,匹配得到至少一帧目标媒体显示图像,对该目标媒体显示图像进行播放显示,形成视频样式的播放效果,实现语音识别转换为视频显示,可以通过终端设备的本地软件应用模式,突破网络环境限制,进行虚拟视频电话交流,该过程不依赖网络,节约流量甚至摆脱流量限制,使通话过程中伴随虚拟化的视频画面,通话过程更加鲜活、灵动,加强了通信效果,增添了通信乐趣,提示用户体验。The process, according to the collected call voice, matches at least one frame of the target media display image, and plays and displays the target media display image to form a video style play effect, thereby realizing the voice recognition to be converted into a video display, which can be performed by the terminal device. The local software application mode breaks through the limitations of the network environment and communicates with virtual video phones. The process does not depend on the network, saves traffic and even gets rid of traffic restrictions, so that the video picture accompanying virtualization during the call, the call process is more lively and smart, and strengthens. The communication effect adds communication fun and prompts the user experience.
本发明实施例中还公开一种媒体显示方法,结合图2所示,包括:A media display method is also disclosed in the embodiment of the present invention.
步骤201:采集目标通话语音。Step 201: Collect a target call voice.
当应用于正常的电话通话过程中时,作为一个实施方式,该采集目标通话语音的步骤,可以包括:监听语音通话进程;确定接收到的对方通话语音为该目标通话语音。即,将通话中的对方联系人的通话语音进行采集,进行接下来的处理过程,实现趣味通话。When applied to a normal telephone call, as an embodiment, the step of collecting the target call voice may include: monitoring a voice call process; and determining that the received party voice is the target call voice. That is, the call voice of the other party in the call is collected, and the next process is performed to realize the fun call.
步骤202:确定与目标通话语音对应的目标表情包。Step 202: Determine a target emoticon packet corresponding to the target call voice.
该目标表情包可以是设定的与不同的目标通话语音相匹配的固定的一个表情包,直接从存储器件中进行确定及读取;也或者是具有针对性的会随着目标通话语音中的特定要素而变化的表情包,需要根据目标通话语音,进行匹配确定。The target emoticon package may be a fixed emoticon package that matches the different target call speech, and is directly determined and read from the storage device; or is targeted to be in accordance with the target call voice. The expression pack that changes with a specific element needs to be matched and determined according to the target call voice.
作为一个实施方式,其中,该确定与所述目标通话语音对应的目标表情包的步骤,包括:In an embodiment, the step of determining a target emoticon packet corresponding to the target call voice includes:
确定所述目标通话语音所对应的目标联系人;调取与所述目标联系人预关联的目标表情包。该表情包可以是与目标联系人实现建立对应关系的资源内容,例如:当已知联系人打电话给使用者的手机时,使用者的手机能够获 知发出的目标通话语音为该联系人发出的,依据通讯录信息,适配来电联系人对应的表情包,可以是目标联系人的本人照片或者是特定图片,针对具体的联系人做不同的设置,使显示结果更具有针对性,更具有趣味。Determining a target contact corresponding to the target call voice; and retrieving a target emoticon package pre-associated with the target contact. The emoticon package may be a resource content that is associated with the target contact, for example, when the contact is known to call the user's mobile phone, the user's mobile phone can obtain Knowing that the target call voice is sent by the contact, according to the address book information, adapting the emoticon packet corresponding to the caller's contact, which may be the photo of the target contact or a specific picture, and make different settings for the specific contact. To make the display results more targeted and more interesting.
步骤203:依据目标通话语音的声纹特征,确定发出目标通话语音所需的至少一个目标发音口型。Step 203: Determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice.
不同的通话语音将会对应不同的发音口型,不同的语音有不同的声纹特征信息,可选地,可根据采集得到的目标通话语音中的声纹特征来确定所需的发音口型。Different call voices will correspond to different voice mouth patterns, and different voices have different voiceprint feature information. Alternatively, the desired voice pattern can be determined according to the voiceprint features in the collected target call voice.
作为一实施方式,其中,该依据所述目标通话语音的声纹特征,确定发出所述目标通话语音所需的至少一个目标发音口型的步骤,包括:In an embodiment, the step of determining at least one target pronunciation type required to send the target call voice according to the voiceprint feature of the target call voice includes:
依据目标通话语音的声纹特征,确定与目标通话语音对应的文字内容;依据该文字内容,从文字与注音的对应关系表中,匹配与该文字内容对应的注音组合;依据该注音组合,从注音与发音口型的对应关系表中,匹配对应的至少一个发音口型;确定该至少一个发音口型为发出该目标通话语音所需的至少一个目标发音口型。Determining the text content corresponding to the target call voice according to the voiceprint feature of the target call voice; and according to the text content, matching the phonetic combination corresponding to the text content from the correspondence table between the text and the phonetic; according to the phonetic combination, In the correspondence table between the phonetic transcription and the pronunciation mouth shape, matching at least one pronunciation mouth shape is matched; determining the at least one pronunciation mouth shape is at least one target pronunciation mouth shape required for issuing the target call voice.
其中,该注音为文字发生的读音标注内容,不同的语言对应于不同的文字,不同的文字对应于不同的注音体系,例如,当文字为中文时,对应的注音为拼音,当文字为英文时,对应的注音为英语音标。该依据目标通话语音的声纹特征,确定对应的发音口型的过程,需要先将语音转换为文字,通过文字匹配对应的注音组合,从而匹配该注音组合对应的目标发音口型。The phonetic transcription is the content of the pronunciation of the text, the different languages correspond to different characters, and the different characters correspond to different phonetic systems. For example, when the text is Chinese, the corresponding phonetic transcription is pinyin, when the text is English The corresponding phonetic is the English phonetic symbol. According to the voiceprint feature of the target call voice, the process of determining the corresponding voice mouth shape needs to first convert the voice into text, and match the corresponding phonetic combination through the text, thereby matching the target pronunciation type corresponding to the phonetic combination.
可选地,该过程实现流程如下:接收语音,语音数据量化处理,调用开源接口实现语音转换文字,其原理为不同的语音有不同的声纹特征信息,可以预先记录声纹特征信息和文字对照,生成数据库对应关系,在捕捉到新语音后和预先记录的数据库进行比对就可查出对应的文字,而文字转换注音过程也是如此,预先记录不同注音和对应不同文字的对照关系,写入数组,生成数据库对应关系,语音转换文字后,再对照数据库中查找到注音,再从数据库中预先设置得到的注音与发音口型的对应表中,得到与注音对应的目标发音口型,将注音拆分适配表情图像,快速切换显示,生成虚拟视频效果。 Optionally, the process is implemented as follows: receiving voice, quantizing the voice data, and calling the open source interface to implement voice conversion text. The principle is that different voices have different voiceprint feature information, and the voiceprint feature information and text comparison can be recorded in advance. Generate a database correspondence. After capturing the new voice, compare it with the pre-recorded database to find the corresponding text, and the text conversion phonetic process is also the same. Pre-record the contrast between the different phonetic and corresponding texts, write Array, generate the database correspondence, after the voice conversion text, and then find the phonetic into the database, and then from the corresponding table of the phonetic and pronunciation mouth type preset in the database, get the target pronunciation type corresponding to the phonetic, the phonetic will be Split and fit the emoticon image, quickly switch the display, and generate a virtual video effect.
一方面,当确定与目标通话语音对应的文字内容为中文汉字时,作为一实施方式,其中,该文字与注音的对应关系表中包括汉字与拼音的对应关系表;依据文字内容,从文字与注音的对应关系表中,匹配与该文字内容对应的注音组合,包括:On the one hand, when it is determined that the text content corresponding to the target call voice is a Chinese character, as an embodiment, the correspondence table between the character and the phonetic includes a correspondence table between the Chinese character and the pinyin; In the correspondence table of the phonetic transcription, the phonetic combination corresponding to the text content is matched, including:
根据该文字内容,从汉字与拼音的对应关系表中,匹配与该文字内容对应的拼音组合。According to the text content, a pinyin combination corresponding to the text content is matched from the correspondence table between the Chinese character and the pinyin.
对应地,该注音与发音口型的对应关系表中包括拼音与发音口型的对应关系表,依据所述注音组合,从注音与发音口型的对应关系表中,匹配对应的至少一个发音口型,包括:依据该拼音组合,从拼音与发音口型的对应关系表中,匹配对应的至少一个发音口型。Correspondingly, the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and according to the phonetic combination, matching at least one pronunciation port from the correspondence table between the phonetic transcription and the pronunciation mouth shape The method includes: matching, according to the pinyin combination, the corresponding at least one pronunciation mouth shape from the correspondence table between the pinyin and the pronunciation mouth shape.
该汉字转换拼音的过程也是如此,预先记录拼音和汉字对照关系,标准的GBK字符集数据库中读取汉字拼音表,写入数组,生成数据库对应关系,语音转换汉字后,汉字再对照数据库中查找到拼音,再从数据库中预先设置得到的拼音与发音口型的对应表中,得到与拼音对应的目标发音口型,将拼音拆分适配表情图像,快速切换显示,生成虚拟视频效果。The process of converting Chinese characters into pinyin is also the same. Pre-recording the relationship between Pinyin and Chinese characters, reading the Chinese Pinyin table in the standard GBK character set database, writing the array, generating the database correspondence, after the speech conversion Chinese characters, the Chinese characters are compared with the database. To the pinyin, and then from the correspondence table of the pinyin and the pronunciation mouth type which are preset in the database, the target pronunciation type corresponding to the pinyin is obtained, the pinyin is split and adapted to the expression image, and the display is quickly switched to generate a virtual video effect.
作为一实施方式,其中,该拼音与发音口型的对应关系表中包括拼音中声母及韵母与发音口型的对应关系;该依据所述拼音组合,从拼音与发音口型的对应关系表中,匹配对应的至少一个发音口型,包括:In an embodiment, the correspondence between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; and the correspondence between the pinyin and the pronunciation type is according to the pinyin combination. Matching at least one pronunciation type corresponding to:
根据所述拼音组合,确定所述拼音组合中所包含的声母和韵母,或者,或者,确定所述拼音组合中所包含的韵母;依据该声母和韵母,或者,依据该韵母,从所述声母及韵母与发音口型的对应关系中,匹配得到相对应的至少一个发音口型。Determining, according to the pinyin combination, the initials and finals included in the phonetic combination, or, or determining the finals included in the pinyin combination; according to the initials and finals, or according to the finals, from the initials In the correspondence between the final and the pronunciation type, the matching obtains at least one corresponding pronunciation type.
该过程,在从汉字与拼音的对应关系表中,匹配到与文字内容对应的拼音组合后,将拼音在数组结构体中定位拆分声母和韵母,由于一些汉字发音只对应于单独的韵母,因此确定得到的拼音组合中可能包含声母及韵母的组合,或者是只有韵母,根据声母和韵母对应口型适配表情图像,并在显示窗口做切换显示,表情快速切换,生成虚拟视频展现形式。可选地,其中的拼音与发音口型的对应关系表也可以是在实现进行导入或下载生成的。 In the process, from the correspondence table between Chinese characters and pinyin, after matching to the pinyin combination corresponding to the text content, the pinyin is positioned and split in the array structure to separate the initials and finals, since some Chinese pronunciations only correspond to separate finals. Therefore, it is determined that the obtained pinyin combination may include a combination of initials and finals, or only a finals, and the emoticons are adapted according to the initials of the initials and the finals, and the display is switched in the display window, and the expressions are quickly switched to generate a virtual video presentation form. Optionally, the correspondence table between the pinyin and the pronunciation port type may also be generated by implementing import or download.
本方案按照在手机上进行通话,实现虚拟视频为例,其中存储设备中设置有汉语拼音字母的口型资源库,在预先收集的标准口型库中,库中每个口型图像都会对应一个拼音字母的发音示例,口型图和拼音表中的声母、韵母一一对应,这样每个发音都可以在口型库中找到对应声母、韵母图像组合关联,使发音口型真正的与实际通话人员的嘴部发声嘴型显示效果一致。This solution is based on the case of making a call on a mobile phone to implement a virtual video. The storage device is provided with a lip-shaped resource library of Chinese pinyin letters. In the pre-collected standard lip-type library, each lip-shaped image in the library corresponds to one. The pronunciation example of the pinyin letters, the mouth pattern and the initials and the finals in the pinyin table correspond one-to-one, so that each pronunciation can find the corresponding initial and final image combination in the mouth type library, so that the pronunciation mouth type is truly and actual. The mouth of the person's mouth is consistent in appearance.
另一方面,当确定与目标通话语音对应的文字内容为英文时,作为一实施方式,其中,该文字与注音的对应关系表中包括英文与英语音标的对应关系表;该依据所述文字内容,从文字与注音的对应关系表中,匹配与所述文字内容对应的注音组合,包括:On the other hand, when it is determined that the text content corresponding to the target call voice is English, as an embodiment, the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; And matching the phonetic combination corresponding to the text content from the correspondence table between the text and the phonetic, including:
根据该文字内容,从英文与英语音标的对应关系表中,匹配与该文字内容对应的音标组合。According to the text content, a phonetic combination corresponding to the text content is matched from the correspondence table between the English and the English phonetic symbols.
对应地,该注音与发音口型的对应关系表中包括英语音标与发音口型的对应关系表,所述依据所述注音组合,从注音与发音口型的对应关系表中,匹配对应的至少一个发音口型,包括:依据所述音标组合,从英语音标与发音口型的对应关系表中,匹配对应的至少一个发音口型。Correspondingly, the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and according to the phonetic combination, the correspondence between the phonetic transcription and the pronunciation mouth shape is matched, at least A pronunciation mouth shape includes: matching at least one pronunciation mouth shape from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.
该英文转换英语音标的过程也是如此,预先记录英语音标和英文间的对照关系,写入数组,生成数据库对应关系,语音转换英文后,英文再对照数据库中查找到英语音标,再从数据库中预先设置的英语音标与发音口型的对应表中,得到与英语音标对应的目标发音口型,将英语音标进行拆分适配表情图像,快速切换显示,生成虚拟视频效果。The same is true for the process of converting English phonetic symbols into English. Pre-recording the contrast relationship between English phonetic symbols and English, writing arrays, generating database correspondence, after converting English into English, English pronunciations are found in the database, and then from the database. In the correspondence table between the English phonetic symbols and the pronunciation mouth type, the target pronunciation type corresponding to the English phonetic symbols is obtained, and the English phonetic symbols are split and adapted to the expression image, and the display is quickly switched to generate a virtual video effect.
作为一实施方式,其中,该英语音标与发音口型的对应关系表中包括英语音标中元音及辅音与发音口型的对应关系;该依据所述音标组合,从英语音标与发音口型的对应关系表中,匹配对应的至少一个发音口型,包括:In an embodiment, the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels in the English phonetic symbols and the consonants and the pronunciation mouth shape; the sound phone combination is based on the phonetic symbols and the pronunciation type In the correspondence table, matching at least one pronunciation port type, including:
根据所述音标组合,确定所述音标组合中所包含的元音和辅音,或者,确定所述音标组合中所包含的元音;依据所述元音和辅音,或者,依据所述元音,从所述元音及辅音与发音口型的对应关系中,匹配得到相对应的至少一个发音口型。Determining a vowel and a consonant included in the phonetic symbol combination according to the phonetic symbol combination, or determining a vowel contained in the phonetic symbol combination; according to the vowel and the consonant, or according to the vowel, From the correspondence between the vowel and the consonant and the pronunciation mouth shape, matching corresponds to at least one pronunciation mouth shape.
该过程,在从英文与英语音标的对应关系表中,匹配到与文字内容对应 的音标组合后,将英语音标在数组结构体中定位拆分元音和辅音,由于一些英文发音只对应于单独的元音,因此确定得到的拼音组合中可能包含元音与辅音的组合,或者是只有元音,根据元音和辅音对应口型适配表情图像,并在显示窗口做切换显示,表情快速切换,生成虚拟视频展现形式。可选地,其中的英语音标与发音口型的对应关系表也可以是在实现进行导入或下载生成的。The process is matched to the text content in the correspondence table between English and English phonetic symbols. After the phonetic symbols are combined, the English phonetic symbols are located in the array structure to locate the split vowels and consonants. Since some English pronunciations only correspond to individual vowels, the determined pinyin combinations may contain a combination of vowels and consonants, or It is only a vowel, which adapts the emoticon according to the vowel and consonant corresponding mouth shape, and performs switching display in the display window, and the expression quickly switches to generate a virtual video presentation form. Optionally, the correspondence table between the English phonetic symbols and the pronunciation mouth shape may also be generated by implementing import or download.
本方案按照在手机上进行通话,实现虚拟视频为例,其中存储设备中设置有英语音标符号的口型资源库,在预先收集的标准口型库中,库中每个口型图像都会对应一个英语音标符号的发音示例,口型图和英语音标表中的元音辅音一一对应,这样每个发音都可以在口型库中找到对应元音、辅音发音图像,进行组合关联,使发音口型真正的与实际通话人员的嘴部发声嘴型显示效果一致。The solution is based on a mobile phone, and the virtual video is taken as an example. The storage device is provided with a mouth-shaped resource library of English phonetic symbols. In the pre-collected standard port library, each port image in the library corresponds to one. The pronunciation example of the English phonetic symbol, the mouth pattern and the vowel consonants in the English phonetic table are in one-to-one correspondence, so that each pronunciation can find the corresponding vowel and consonant pronunciation image in the mouth type library, and the combination is associated to make the pronunciation port The type is really consistent with the actual mouth of the caller.
步骤204:从目标表情包中匹配得到包含目标发音口型的至少一帧目标媒体显示图像。Step 204: Match at least one frame of the target media display image including the target vocalization pattern from the target expression package.
可选地,该目标表情包即为包含不同嘴型的表情包,该嘴型表情包中所包含的媒体显示图像也为嘴型具有不同张合的显示图像,可选地,该目标媒体显示图像种除了可以包含目标发音口型外,还可以包含其他的变化元素,例如眉毛的变化、表情及脸型等等,进行与嘴型进行搭配。Optionally, the target emoticon package is an emoticon package containing different mouth shapes, and the media display image included in the mouth emoticon package is also a display image with different shapes for the mouth type, and optionally, the target media display In addition to the target pronunciation type, the image can also contain other changing elements, such as eyebrow changes, expressions and face types, etc., to match the mouth shape.
步骤205:通过显示界面对至少一帧目标媒体显示图像进行播放显示。Step 205: Perform play display on at least one frame of the target media display image through the display interface.
该过程,在双方对话时,使用者终端,例如为手机,依据语音识别进行适配关联表情,经对应图像资源合成优化,使图片随语音进行连续的播放,表情随之不断切换更新,可以在无网络环境下生成虚拟视频场景,实现虚拟显示视频对话的场景,让对话交流更有效、多趣。In the process, when the two parties talk, the user terminal, for example, a mobile phone, adapts the associated expression according to the voice recognition, and optimizes the corresponding image resource synthesis, so that the picture is continuously played with the voice, and the expression is continuously switched and updated. The virtual video scene is generated in a network-free environment, and the scene of virtual display video dialogue is realized, so that the dialogue communication is more effective and interesting.
下面对该过程进行描述。该语音转换文字拼音并适配表情的技术实现步骤如下:The process is described below. The technical implementation steps of the speech conversion text pinching and adapting the expression are as follows:
步骤1.预先在标准的汉字编码字符集(GBK)数据库中读取汉字拼音内容,将汉字拼音数据写入数组,数组的每个元素为一个结构体,结构体分为四部分,拼音,拼音拆分后的声母,韵母,及拼音对应的汉字。 Step 1. Read the Chinese pinyin content in the standard Chinese character encoding character set (GBK) database in advance, and write the Chinese pinyin data into the array. Each element of the array is a structure, and the structure is divided into four parts, pinyin and pinyin. The initials after splitting, the finals, and the Chinese characters corresponding to Pinyin.
步骤2.语音通过开源接口转换为文字,将文字作为关键字在数组中查找,可以找出对应拼音,再由拼音在对应结构体中拆分查询到关联的声母、韵母。Step 2. The voice is converted into text through the open source interface, and the text is searched in the array as a key, and the corresponding pinyin can be found, and then the pinchin is split into the corresponding initial and initials.
步骤3.声母、韵母对照查找关联表情,在桌面创建可视对话框作为用户界面(UI)窗口,在UI窗口中对表情进行呈现。Step 3. The initials and the finals look up the associated expressions, create a visual dialog on the desktop as a user interface (UI) window, and present the expressions in the UI window.
在对该至少一帧目标媒体显示图像进行连续的播放显示时,对于呈现时间,可以先计算目标通话语音的时间长度,根据目标通话语音的时间长度,确定每帧目标媒体显示图像进行连续的播放显示时的播放时长,暗改播放时长进行连续的播放显示,可选的将目标通话语音的时间长度除以目标媒体显示图像的帧个数,得到一帧图像的播放时长,以“你好”为例,假设语音“你好”时间长度为1秒,语音分解为表情对应4张图像,每张图像显示1/4秒,即每张图像显示时间为250毫秒,按照目标通话语音获取到的先后顺序,对与目标通话语音匹配的至少一帧目标媒体显示图像进行连续的切换呈现。此时窗口呈现的表情和接收语音对应,通话过程中,随语音不断变化,用户表情包在UI窗口快速切换,即呈现出视频效果,从而实现虚拟视频通话功能。When the at least one frame of the target media display image is continuously played and displayed, for the presentation time, the time length of the target call voice may be calculated first, and according to the time length of the target call voice, the target media display image is continuously played for each frame. The playing time when displaying, the playing time of the playing time is changed continuously, and the length of the target call voice is divided by the number of frames of the target media display image, and the playing time of one frame of the image is obtained, to "Hello". For example, suppose the voice "hello" time is 1 second, the voice is decomposed into four images corresponding to the expression, and each image is displayed for 1/4 second, that is, each image is displayed for 250 milliseconds, which is obtained according to the target call voice. In sequence, at least one frame of the target media display image that matches the target call voice is continuously switched and presented. At this time, the expression presented by the window corresponds to the received voice. During the call, as the voice changes continuously, the user's expression package is quickly switched in the UI window, that is, the video effect is presented, thereby realizing the virtual video call function.
作为一实施方式,其中,通过显示界面对所述至少一帧目标媒体显示图像进行播放显示的步骤,包括:通过该显示界面,以获取得到的一显示图像为背景画面,对所述至少一帧目标媒体显示图像进行播放显示。In an embodiment, the step of performing play display on the at least one frame of the target media display image by using the display interface includes: obtaining, by using the display interface, a obtained display image as a background image, and the at least one frame The target media displays an image for playback display.
在对目标媒体显示图像进行播放显示时,可添加播放背景,该背景可以是设定的与联系人相对应的一个固定背景,或者是根据在目标通话语音中识别到的关键字或词来进行匹配获取对应的图像作为背景画面,使该背景画面可以根据用户语音内容而发生变化。When playing a display of the target media display image, a play background may be added, the background may be a fixed background corresponding to the set contact, or may be based on a keyword or word recognized in the target call voice. The matching acquires the corresponding image as a background image, so that the background image can be changed according to the user's voice content.
本过程主要是在没有无线网络环境下,通过语音识别,进行口型对应查找适配,并对库中已知场景图像(背景画面)进行合成处理,从而实现无网络环境下的虚拟场景再现,可以在无网络环境下,不断切换的场景及表情,可以实现虚拟显示视频对话的场景,让对话交流更有效、多趣。This process mainly performs voice matching, performs voice matching, and performs synthesis processing on known scene images (background images) in the library without a wireless network environment, thereby realizing virtual scene reproduction in a network-free environment. In the no-network environment, the scenes and expressions that are constantly switched can realize the scene of virtual display video dialogue, making dialogue communication more effective and interesting.
可选地,该核心实现流程如下,例如在手机上进行通话语音对话时,双方手机存在预存储的表情、场景图像,接收到对方语音时,本地优先启动语音识别模块,通过识别切换对应媒体显示资源,在终端对话界面显示,语音 随时变化,语音识别不断适配对应表情资源,终端界面中可以显示一固定的背景画面或者根据语音内容适配对应背景画面,表情配合背景画面在终端界面切换,该快速切换的视觉效果,就是现实的视频对话场景,从而本地实现了虚拟现实的视频对话。Optionally, the core implementation process is as follows. For example, when a call voice conversation is performed on a mobile phone, the mobile phones of both parties have pre-stored expressions and scene images. When receiving the voice of the other party, the local priority activates the voice recognition module, and the corresponding media display is switched by the recognition. Resources, displayed in the terminal dialogue interface, voice The voice recognition continuously adapts to the corresponding expression resources, and the terminal interface can display a fixed background image or adapt the corresponding background image according to the voice content, and the expression matches the background image in the terminal interface, and the visual effect of the fast switching is reality. The video conversation scene, thus realizing the virtual reality video dialogue.
作为一实施方式,其中,采集目标通话语音的步骤之前,还包括:As an embodiment, before the step of collecting the target call voice, the method further includes:
获取素材资源包及通话联系人的个人图像,其中,所述素材资源包中包含至少一个媒体素材图像;将通话联系人的个人图像与每一媒体素材图像进行整合,生成至少一个媒体显示图像,得到包含所述至少一个媒体显示图像的表情包。Obtaining a personal image of the asset package and the call contact, wherein the asset package includes at least one media material image; integrating the personal image of the call contact with each media material image to generate at least one media display image, An emoticon packet containing the at least one media display image is obtained.
其中,该个人图像可以是个人的表情图像,或者是与通话联系人具有关联关系的、与其相关的图像。该过程对应于包含媒体显示图像的表情包的初始化过程,以使能够根据目标通话语音,获取相匹配的至少一帧目标媒体显示图像。该素材资源包可以是在有网络情况下事先下载得到。个人图像与每一媒体素材图像进行整合的过程,可以是例如抠像填充,局部替换,局部覆盖等方法来实现。The personal image may be a personal expression image or an image associated with the call contact. The process corresponds to an initialization process of the emoticon packet containing the media display image to enable acquisition of at least one frame of the target media display image in accordance with the target call speech. The asset package can be downloaded in advance when there is a network. The process of integrating the personal image with each media material image may be implemented by, for example, image filling, partial replacement, partial coverage, and the like.
可选地,例如是,使用者在手机中安装虚拟模拟软件,软件中可以任意设置图像场景,上传用户图像,初始化预置,生成用户表情包。首先用户手机存储所需图像资源,可以是预前拍照或者网络下载到手机本地,通常包括用户个人图像、素材资源包、常规视频对话所在场景图片,素材资源包例如为口型资源包,其中口型资源包为本发明实施例软件自身集成,提供用户使用。以口型资源包为例,软件本身集成口型图像资源,用户使用前先提供个人图像,对软件初始化,初始化可以将口型图像和个人图像进行合成,并做图像优化,将口型包合成到用户本人图像,生成用户对应拼音表字母的表情包。图像合成技术先将用户人脸图像中口型及边缘区域清除,和相同大小的口型资源叠加,再对图像优化处理,得到对应拼音字母表的用户定制表情包。Optionally, for example, the user installs the virtual simulation software in the mobile phone, and the software can arbitrarily set the image scene, upload the user image, initialize the preset, and generate the user expression package. First, the user's mobile phone stores the required image resources, which may be pre-camera or network download to the mobile phone local, usually including the user's personal image, the material resource package, and the scene picture of the regular video conversation. The material resource package is, for example, a lip-type resource package, where the mouth is The type resource package is integrated by the software of the embodiment of the present invention and provides user use. Taking the mouth resource package as an example, the software itself integrates the lip image resource. The user provides a personal image before use, initializes the software, initializes the lip image and the personal image, and performs image optimization to synthesize the lip package. To the user's own image, an emoticon packet corresponding to the pinyin alphabet of the user is generated. The image synthesis technology first clears the lip and edge regions of the user's face image, and superimposes the same size of the lip-shaped resources, and then optimizes the image to obtain a user-defined emoticon packet corresponding to the pinyin alphabet.
作为一实施方式,其中,该个人图像中包括个人的脸部图像,媒体素材图像中包括拼音中声母及韵母所对应的发音口型图像或英语音标中元音及辅音所对应的发音口型图像,该将通话联系人的个人图像与每一所述媒体素材图像进行整合,生成至少一个媒体显示图像,得到包含至少一个媒体显示 图像的表情包的步骤,包括:In one embodiment, the personal image includes a facial image of the individual, and the media material image includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol. And integrating the personal image of the call contact with each of the media material images to generate at least one media display image, and obtaining at least one media display The steps of the image emoticon package include:
识别所述脸部图像中的嘴部区域;将所述媒体素材图像中的发音口型图像在所述嘴部区域进行填充替换;生成与所述媒体素材图像中的每一所述发音口型图像对应的媒体显示图像,得到包含与每一所述发音口型图像对应的媒体显示图像的表情包。Identifying a mouth region in the facial image; filling and replacing a vocal vocal image in the media material image in the mouth region; generating each of the vocalization patterns in the image of the media material The media corresponding to the image displays an image, and an emoticon packet including a media display image corresponding to each of the vocal vocal images is obtained.
以媒体素材图像中包括拼音中声母及韵母所对应的发音口型图像为例,该人脸和口型合成表情包的技术实现步骤如下:Taking the pronunciation image of the media material image including the initials and the finals in the pinyin as an example, the technical implementation steps of the face and mouth synthesis expression package are as follows:
步骤1.以甲乙两人进行通话为例,甲持有的语音通信设备中,预先存储乙的图像资源,另预先存储对应汉语拼音字母表所有字母读音的口型图像资源。 Step 1. Taking a call between two parties, for example, in the voice communication device held by A, the image resource of B is stored in advance, and the lip image resource corresponding to all the alphabetic pronunciation of the Chinese pinyin alphabet is stored in advance.
步骤2.将人脸和口型彩色图像转换为灰度图像,对于彩色转灰度,按照经典运算公式,Gray=R*0.299+G*0.587+B*0.114,为避免低速的浮点运算,引入整数算法,并作四舍五入,得到等价的变种算法,Gray=(R*30+G*59+B*11+50)/100,提升了计算转换效率。Step 2. Convert the face and mouth color image into a grayscale image. For color to grayscale, according to the classical formula, Gray=R*0.299+G*0.587+B*0.114, in order to avoid low-speed floating-point operations, The integer algorithm is introduced and rounded off to obtain an equivalent variant algorithm. Gray=(R*30+G*59+B*11+50)/100, which improves the computational conversion efficiency.
步骤3.灰度图像取灰度阈值,使用阈值将人脸进行分割,实现嘴部区域检测,对人脸图像进行边缘检测,使用均值算子模板(即每个像素值计算更新为临近像素的均值),对灰度图像进行处理,可以检测人脸特征区域,如眼口鼻等特征区域;或者按照人脸对称和结构分布,识别嘴部区域。Step 3. The grayscale image takes the grayscale threshold, uses the threshold to segment the face, implements the mouth region detection, performs edge detection on the face image, and uses the mean operator template (ie, each pixel value is calculated and updated to the adjacent pixel). Mean), processing the grayscale image, can detect the feature area of the face, such as the eye area and the like, or identify the mouth area according to the symmetry and structure distribution of the face.
步骤4.将原有口型资源填充替换在人脸区域检测的口型区域,生成表情,将步骤1中生成的口型图像的像素值每行进行量化采样,使其像素个数和步骤3的口型区域像素个数一致,并将采样量化后的口型资源替换填充至步骤3的口型区域,重新构建生成人脸表情图像。 Step 4. Replace the original lip type resource with the lip area detected in the face area, generate an expression, and quantize the pixel value of the lip image generated in step 1 to make the number of pixels and step 3 The number of pixels in the lip-shaped area is the same, and the sampled and quantized lip-shaped resource is replaced and filled into the lip-shaped area of step 3, and the generated facial expression image is reconstructed.
步骤5.对新生成的表情图像进行高斯滤波处理,增强合成图像的平滑性,生成表情资源库,库中每张表情图像为人脸口型对应字母表的读音口型。Step 5. Perform Gaussian filtering on the newly generated expression image, enhance the smoothness of the synthesized image, and generate an expression resource library. Each expression image in the library is a pronunciation type of the face type corresponding to the alphabet.
人脸对应和不同口型合成,生成不同的表情图像,每张图像对应拼音字母的读音口型表情。对新生成的人脸表情图像需要通过高斯滤波进行去噪平滑处理,进而获取到清晰表情图像。高斯函数计算得到的模板是浮点型数字,为平衡过滤效果和计算效率,使用整数5×5模板算子,系数为1/273,模板 如图4示例。The face correspondence and the different mouth shapes are combined to generate different expression images, and each image corresponds to the pronunciation of the pinyin letters. The newly generated facial expression image needs to be denoised and smoothed by Gaussian filtering to obtain a clear expression image. The template calculated by the Gaussian function is a floating-point number. For balanced filtering and computational efficiency, an integer 5×5 template operator is used, and the coefficient is 1/273. As shown in Figure 4.
该过程,根据采集到的通话语音,匹配得到至少一帧目标媒体显示图像,对该些目标媒体显示图像进行连续的播放显示,形成视频样式的播放效果,实现语音识别转换为视频显示,可以通过终端设备的本地软件应用模式,突破网络环境限制,进行虚拟视频电话交流,该过程不依赖网络,节约流量甚至摆脱流量限制,使通话过程中伴随虚拟化的视频画面,通话过程更加鲜活、灵动,加强了通信效果,增添了通信乐趣。In the process, according to the collected call voice, at least one frame of the target media display image is matched, and the target media display images are continuously played and displayed to form a video style playing effect, and the voice recognition is converted into a video display, which can be adopted. The local software application mode of the terminal device breaks through the limitation of the network environment and performs virtual video telephone communication. The process does not depend on the network, saves traffic and even gets rid of the traffic restriction, so that the video picture accompanying virtualization during the call is more vivid and smart. It enhances the communication effect and adds communication fun.
本发明实施例还公开一种媒体显示终端,结合图3所示,包括:采集模块301、第一获取模块302和显示模块303。该媒体显示终端可以是支持语音的例如智能手表、手机等终端。The embodiment of the present invention further discloses a media display terminal, which is shown in FIG. 3, and includes an acquisition module 301, a first acquisition module 302, and a display module 303. The media display terminal may be a terminal such as a smart watch or a mobile phone that supports voice.
采集模块301,设置为采集目标通话语音。The collecting module 301 is configured to collect a target call voice.
第一获取模块302,设置为根据所述目标通话语音,获取与所述目标通话语音匹配的至少一帧目标媒体显示图像。The first obtaining module 302 is configured to acquire, according to the target call voice, at least one frame target media display image that matches the target call voice.
显示模块303,设置为通过显示界面对所述至少一帧目标媒体显示图像进行播放显示。The display module 303 is configured to perform play display on the at least one frame of the target media display image through the display interface.
其中,所第一获取模块,包括:The first acquiring module includes:
第一确定子模块,设置为确定与所述目标通话语音对应的目标表情包。The first determining submodule is configured to determine a target emoticon packet corresponding to the target call voice.
第二确定子模块,设置为依据所述目标通话语音的声纹特征,确定发出所述目标通话语音所需的至少一个目标发音口型。The second determining submodule is configured to determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice.
获取子模块,设置为从所述目标表情包中匹配得到包含所述目标发音口型的至少一帧目标媒体显示图像。Obtaining a sub-module, configured to obtain at least one frame of the target media display image including the target pronunciation lip shape from the target emoticon package.
其中,所述第二确定子模块,包括:The second determining submodule includes:
第一确定单元,设置为依据所述目标通话语音的声纹特征,确定与所述目标通话语音对应的文字内容。The first determining unit is configured to determine the text content corresponding to the target call voice according to the voiceprint feature of the target call voice.
第一匹配单元,设置为依据所述文字内容,从文字与注音的对应关系表中,匹配与所述文字内容对应的注音组合。The first matching unit is configured to match the phonetic combination corresponding to the text content from the correspondence table between the text and the phonetic according to the text content.
第二匹配单元,设置为依据所述注音组合,从注音与发音口型的对应关 系表中,匹配对应的至少一个发音口型。a second matching unit, configured to correspond to the phonetic and pronunciation type according to the phonetic combination In the table, at least one corresponding pronunciation type is matched.
第二确定单元,设置为确定所述至少一个发音口型为发出所述目标通话语音所需的至少一个目标发音口型。The second determining unit is configured to determine that the at least one pronunciation port type is at least one target pronunciation type required to issue the target call voice.
其中,所述文字与注音的对应关系表中包括汉字与拼音的对应关系表;所述第一匹配单元包括:The correspondence table between the characters and the phonetic includes a correspondence table between Chinese characters and pinyin; the first matching unit includes:
第一匹配子单元,设置为根据所述文字内容,从汉字与拼音的对应关系表中,匹配与所述文字内容对应的拼音组合。The first matching subunit is configured to match the pinyin combination corresponding to the text content from the correspondence table between the Chinese character and the pinyin according to the text content.
所述注音与发音口型的对应关系表中包括拼音与发音口型的对应关系表,所述第二匹配单元包括:The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and the second matching unit includes:
第二匹配子单元,设置为依据所述拼音组合,从拼音与发音口型的对应关系表中,匹配对应的至少一个发音口型。The second matching subunit is configured to match the corresponding at least one pronunciation lip shape from the correspondence table between the pinyin and the pronunciation lip shape according to the pinyin combination.
其中,所述拼音与发音口型的对应关系表中包括拼音中声母及韵母与发音口型的对应关系;所述第二匹配子单元是设置为:The correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; the second matching subunit is set as:
根据所述拼音组合,确定所述拼音组合中所包含的声母和韵母,或者,确定所述拼音组合中所包含的韵母;依据所述声母和韵母,或者,依据所述韵母,从所述声母及韵母与发音口型的对应关系中,匹配得到相对应的至少一个发音口型。Determining, according to the pinyin combination, an initial and a final included in the phonetic combination, or determining a final included in the phonetic combination; according to the initial and final, or according to the final, from the initial In the correspondence between the final and the pronunciation type, the matching obtains at least one corresponding pronunciation type.
其中,所述文字与注音的对应关系表中包括英文与英语音标的对应关系表;所述第一匹配单元包括:The correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; the first matching unit includes:
第三匹配子单元,设置为根据所述文字内容,从英文与英语音标的对应关系表中,匹配与所述文字内容对应的音标组合。The third matching subunit is configured to match the phonetic symbol corresponding to the text content from the correspondence table between the English and the English phonetic symbols according to the text content.
所述注音与发音口型的对应关系表中包括英语音标与发音口型的对应关系表,所述第二匹配单元包括:The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and the second matching unit includes:
第四匹配子单元,设置为依据所述音标组合,从英语音标与发音口型的对应关系表中,匹配对应的至少一个发音口型。The fourth matching subunit is configured to match the corresponding at least one pronunciation port type from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.
其中,所述英语音标与发音口型的对应关系表中包括英语音标中元音及辅音与发音口型的对应关系;所述第四匹配子单元是设置为: The correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels in the English phonetic symbols and the consonants and the pronunciation mouth shape; the fourth matching subunit is set as:
根据所述音标组合,确定所述音标组合中所包含的元音和辅音,或者,确定所述音标组合中所包含的元音;依据所述元音和辅音,或者,依据所述元音,从所述元音及辅音与发音口型的对应关系中,匹配得到相对应的至少一个发音口型。Determining a vowel and a consonant included in the phonetic symbol combination according to the phonetic symbol combination, or determining a vowel contained in the phonetic symbol combination; according to the vowel and the consonant, or according to the vowel, From the correspondence between the vowel and the consonant and the pronunciation mouth shape, matching corresponds to at least one pronunciation mouth shape.
其中,所述第一确定子模块,包括:The first determining submodule includes:
第三确定单元,设置为确定所述目标通话语音所对应的目标联系人。The third determining unit is configured to determine a target contact corresponding to the target call voice.
调取单元,设置为调取与所述目标联系人预关联的目标表情包。The retrieval unit is configured to retrieve a target expression package pre-associated with the target contact.
其中,所述采集模块,包括:The collection module includes:
监听子模块,设置为监听语音通话进程。The monitor submodule is set to listen to the voice call process.
第三确定子模块,设置为确定接收到的对方通话语音为所述目标通话语音。The third determining submodule is configured to determine that the received counterpart voice is the target call voice.
其中,该终端还包括:The terminal further includes:
第二获取模块,设置为获取素材资源包及通话联系人的个人图像,其中,所述素材资源包中包含至少一个媒体素材图像。The second obtaining module is configured to obtain a personal image of the asset resource package and the call contact, wherein the asset resource package includes at least one media material image.
生成模块,设置为将所述通话联系人的个人图像与每一所述媒体素材图像进行整合,生成至少一个媒体显示图像,得到包含所述至少一个媒体显示图像的表情包。And a generating module, configured to integrate the personal image of the call contact with each of the media material images to generate at least one media display image, to obtain an emoticon package including the at least one media display image.
其中,所述个人图像中包括个人的脸部图像,所述媒体素材图像中包括拼音中声母及韵母所对应的发音口型图像或英语音标中元音及辅音所对应的发音口型图像,所述生成模块,包括:Wherein, the personal image includes a facial image of a person, and the image of the media material includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol. The generation module includes:
识别子模块,设置为识别所述脸部图像中的嘴部区域。An identification sub-module is provided to identify a mouth region in the facial image.
替换模块,设置为将所述媒体素材图像中的发音口型图像在所述嘴部区域进行填充替换。And a replacement module configured to fill and replace the voiced mouth image in the media material image in the mouth region.
生成子模块,设置为生成与所述媒体素材图像中的每一所述发音口型图像对应的媒体显示图像,得到包含与每一所述发音口型图像对应的媒体显示图像的表情包。And generating a sub-module, configured to generate a media display image corresponding to each of the vocal-mouth images in the media material image, to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.
其中,所述显示模块,包括: The display module includes:
显示子模块,设置为通过所述显示界面,以获取得到的一显示图像为背景画面,对所述至少一帧目标媒体显示图像进行播放显示。The display sub-module is configured to obtain, by using the display interface, a obtained display image as a background image, and play and display the at least one frame of the target media display image.
该媒体显示终端,根据采集到的通话语音,匹配得到至少一帧目标媒体显示图像,对该些目标媒体显示图像进行连续的播放显示,形成视频样式的播放效果,实现语音识别转换为视频显示,可以通过终端设备的本地软件应用模式,突破网络环境限制,进行虚拟视频电话交流,该过程不依赖网络,节约流量甚至摆脱流量限制,使通话过程中伴随虚拟化的视频画面,通话过程更加鲜活、灵动,加强了通信效果,增添了通信乐趣。The media display terminal matches at least one frame of the target media display image according to the collected call voice, and performs continuous play display on the target media display images to form a video style play effect, and realizes voice recognition to be converted into a video display. Through the local software application mode of the terminal device, the virtual video telephone communication can be broken through the network environment limitation. The process does not depend on the network, saves traffic and even gets rid of the traffic restriction, so that the virtualized video picture is accompanied during the call, and the call process is more vivid. Smart, enhanced communication, and increased communication fun.
本发明实施例还提供了一种计算机可读存储介质,存储有计算机可执行指令,所述计算机可执行指令被处理器执行时实现上述实施例所述的方法。The embodiment of the invention further provides a computer readable storage medium storing computer executable instructions, which are implemented by the processor to implement the method described in the foregoing embodiments.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理单元的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些组件或所有组件可以被实施为由处理器,如数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其 他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and functional blocks/units of the methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical units; for example, one physical component may have multiple functions, or one function or step may be composed of several physical The components work together. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on a computer readable medium, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to those of ordinary skill in the art, the term computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules, or other data. , removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridge, magnetic tape, magnetic disk storage or other magnetic storage device, or may Any other medium used to store the desired information and that can be accessed by the computer. Moreover, it is well known to those skilled in the art that communication media typically comprise computer readable instructions, data structures, program modules or such as a carrier wave or He transmits other data in the data signal, such as a transmission mechanism, and may include any information delivery medium.
尽管已描述了本发明实施例的可选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括可选实施例以及落入本发明实施例范围的所有变更和修改。Although alternative embodiments of the embodiments of the present invention have been described, those skilled in the art can make additional changes and modifications to the embodiments once they are aware of the basic inventive concept. Therefore, the appended claims are intended to be interpreted as including all alternatives and modifications of the embodiments of the invention.
最后,还需要说明的是,在本发明实施例中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should also be noted that in the embodiments of the present invention, relational terms such as first and second, etc. are merely used to distinguish one entity or operation from another entity or operation, without necessarily requiring or Imply that there is any such actual relationship or order between these entities or operations. Furthermore, the terms "comprises" or "comprising" or "comprising" or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, or terminal device that includes a plurality of elements includes not only those elements but also Other elements that are included, or include elements inherent to such a process, method, article, or terminal device. An element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article, or terminal device that comprises the element, without further limitation.
以上所述的是本发明的可选实施方式,应当指出对于本技术领域的普通人员来说,在不脱离本发明所述的原理前提下还可以作出若干改进和润饰,这些改进和润饰也在本发明的保护范围内。The above is an alternative embodiment of the present invention, and it should be noted that those skilled in the art can also make several improvements and retouchings without departing from the principles of the present invention. These improvements and retouchings are also Within the scope of protection of the present invention.
工业实用性Industrial applicability
本发明实施例突破网络环境限制,进行虚拟视频电话交流,该过程不依赖网络,节约流量甚至摆脱流量限制,使通话过程中伴随虚拟化的视频画面,通话过程更加鲜活、灵动,加强了通信效果,增添了通信乐趣,提示用户体验。 The embodiment of the present invention breaks through the limitation of the network environment and performs virtual video telephone communication. The process does not depend on the network, saves traffic and even gets rid of the traffic restriction, so that the video picture accompanying virtualization during the call process is more vivid and smart, and the communication is strengthened. The effect adds fun to the communication and prompts the user experience.

Claims (24)

  1. 一种媒体显示方法,包括:A media display method includes:
    采集目标通话语音(101);Collecting target call voice (101);
    根据所述目标通话语音,获取与所述目标通话语音匹配的至少一帧目标媒体显示图像(102);Obtaining at least one frame target media display image that matches the target call voice according to the target call voice (102);
    通过显示界面对所述至少一帧目标媒体显示图像进行播放显示(103)。The at least one frame of the target media display image is played and displayed through the display interface (103).
  2. 根据权利要求1所述的方法,其中,所述根据所述目标通话语音,获取与所述目标通话语音匹配的至少一帧目标媒体显示图像(102),包括:The method of claim 1, wherein the acquiring at least one frame of the target media display image (102) that matches the target call voice according to the target call voice comprises:
    确定与所述目标通话语音对应的目标表情包(202);Determining a target emoticon packet corresponding to the target call voice (202);
    依据所述目标通话语音的声纹特征,确定发出所述目标通话语音所需的至少一个目标发音口型(203);Determining at least one target pronunciation type (203) required to issue the target call voice according to the voiceprint feature of the target call voice;
    从所述目标表情包中匹配得到包含所述目标发音口型的至少一帧目标媒体显示图像(204)。At least one frame of the target media display image including the target vocalization pattern is obtained from the target expression package (204).
  3. 根据权利要求2所述的方法,其中,所述依据所述目标通话语音的声纹特征,确定发出所述目标通话语音所需的至少一个目标发音口型,包括:The method according to claim 2, wherein the determining, according to the voiceprint feature of the target call voice, the at least one target voice vocal type required to issue the target call voice comprises:
    依据所述目标通话语音的声纹特征,确定与所述目标通话语音对应的文字内容;Determining text content corresponding to the target call voice according to the voiceprint feature of the target call voice;
    依据所述文字内容,从文字与注音的对应关系表中,匹配与所述文字内容对应的注音组合;And matching, according to the text content, a phonetic combination corresponding to the text content from a correspondence table between text and phonetic;
    依据所述注音组合,从注音与发音口型的对应关系表中,匹配对应的至少一个发音口型;According to the phonetic combination, at least one corresponding pronunciation type is matched from the correspondence table between the phonetic and the pronunciation mouth type;
    确定所述至少一个发音口型为发出所述目标通话语音所需的至少一个目标发音口型。Determining the at least one pronunciation port type is at least one target pronunciation port type required to issue the target call voice.
  4. 根据权利要求3所述的方法,其中,所述文字与注音的对应关系表中包括汉字与拼音的对应关系表;所述依据所述文字内容,从文字与注音的对应关系表中,匹配与所述文字内容对应的注音组合,包括:The method according to claim 3, wherein the correspondence table between the characters and the phonetic includes a correspondence table of Chinese characters and pinyin; and according to the text content, matching and matching from the correspondence table between the characters and the phonetic The phonetic combination corresponding to the text content includes:
    根据所述文字内容,从汉字与拼音的对应关系表中,匹配与所述文字内 容对应的拼音组合;According to the text content, from the correspondence table between Chinese characters and pinyin, matching with the text The corresponding pinyin combination;
    所述注音与发音口型的对应关系表中包括拼音与发音口型的对应关系表,所述依据所述注音组合,从注音与发音口型的对应关系表中,匹配对应的至少一个发音口型,包括:The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and according to the phonetic combination, matching at least one pronunciation port from the correspondence table between the phonetic transcription and the pronunciation mouth shape Type, including:
    依据所述拼音组合,从拼音与发音口型的对应关系表中,匹配对应的至少一个发音口型。According to the pinyin combination, at least one corresponding pronunciation type is matched from the correspondence table between the pinyin and the pronunciation type.
  5. 根据权利要求4所述的方法,其中,所述拼音与发音口型的对应关系表中包括拼音中声母及韵母与发音口型的对应关系;所述依据所述拼音组合,从拼音与发音口型的对应关系表中,匹配对应的至少一个发音口型,包括:The method according to claim 4, wherein the correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; the pinyin and the pronunciation port according to the pinyin combination The corresponding correspondence table of the type matches at least one pronunciation type corresponding to the type, including:
    根据所述拼音组合,确定所述拼音组合中所包含的声母和韵母,或者,确定所述拼音组合中所包含的韵母;Determining, according to the pinyin combination, an initial and a final included in the phonetic combination, or determining a final included in the pinyin combination;
    依据所述声母和韵母,或者,依据所述韵母,从所述声母及韵母与发音口型的对应关系中,匹配得到相对应的至少一个发音口型。According to the initial and the final, or according to the final, the corresponding at least one pronunciation type is obtained from the correspondence between the initial and the final and the pronunciation.
  6. 根据权利要求3所述的方法,其中,所述文字与注音的对应关系表中包括英文与英语音标的对应关系表;所述依据所述文字内容,从文字与注音的对应关系表中,匹配与所述文字内容对应的注音组合,包括:The method according to claim 3, wherein the correspondence table between the characters and the phonetic includes a correspondence table of English and English phonetic symbols; and according to the text content, matching from the correspondence table between the characters and the phonetic A phonetic combination corresponding to the text content, including:
    根据所述文字内容,从英文与英语音标的对应关系表中,匹配与所述文字内容对应的音标组合;And matching, according to the text content, a phonetic symbol combination corresponding to the text content from a correspondence table between English and English phonetic symbols;
    所述注音与发音口型的对应关系表中包括英语音标与发音口型的对应关系表,所述依据所述注音组合,从注音与发音口型的对应关系表中,匹配对应的至少一个发音口型,包括:The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and according to the phonetic transcription combination, matching at least one pronunciation corresponding from the correspondence table between the phonetic transcription and the pronunciation mouth shape Mouth type, including:
    依据所述音标组合,从英语音标与发音口型的对应关系表中,匹配对应的至少一个发音口型。According to the phonetic symbol combination, at least one corresponding pronunciation type is matched from the correspondence table between the English phonetic symbols and the pronunciation mouth shape.
  7. 根据权利要求6所述的方法,其中,所述英语音标与发音口型的对应关系表中包括英语音标中元音及辅音与发音口型的对应关系;所述依据所述音标组合,从英语音标与发音口型的对应关系表中,匹配对应的至少一个发音口型,包括: The method according to claim 6, wherein the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels in the English phonetic symbols and the consonants and the pronunciation mouth type; the according to the phonetic symbols combination, from the English In the correspondence table between the phonetic symbols and the pronunciation mouth shape, at least one pronunciation port type corresponding to the matching, including:
    根据所述音标组合,确定所述音标组合中所包含的元音和辅音,或者,确定所述音标组合中所包含的元音;Determining a vowel and a consonant included in the phonetic symbol combination according to the phonetic symbol combination, or determining a vowel contained in the phonetic symbol combination;
    依据所述元音和辅音,或者,依据所述元音,从所述元音及辅音与发音口型的对应关系中,匹配得到相对应的至少一个发音口型。According to the vowel and the consonant, or according to the vowel, from the correspondence between the vowel and the consonant and the pronunciation mouth shape, matching corresponds to at least one pronunciation mouth shape.
  8. 根据权利要求2所述的方法,其中,所述确定与所述目标通话语音对应的目标表情包,包括:The method of claim 2, wherein the determining a target emoticon packet corresponding to the target call voice comprises:
    确定所述目标通话语音所对应的目标联系人;Determining a target contact corresponding to the target call voice;
    调取与所述目标联系人预关联的目标表情包。Retrieving a target emoticon package pre-associated with the target contact.
  9. 根据权利要求1至8中任一项所述的方法,其中,所述采集目标通话语音,包括:The method according to any one of claims 1 to 8, wherein the collecting a target call voice comprises:
    监听语音通话进程;Monitor the voice call process;
    确定接收到的对方通话语音为所述目标通话语音。Determining the received call voice of the other party as the target call voice.
  10. 根据权利要求1-8任一项所述的方法,所述方法还包括:The method of any of claims 1-8, the method further comprising:
    所述采集目标通话语音之前,获取素材资源包及通话联系人的个人图像,其中,所述素材资源包中包含至少一个媒体素材图像;Before collecting the target call voice, acquiring a personal image of the asset resource package and the call contact, wherein the asset resource package includes at least one media material image;
    将所述通话联系人的个人图像与每一所述媒体素材图像进行整合,生成至少一个媒体显示图像,得到包含所述至少一个媒体显示图像的表情包。Integrating the personal image of the call contact with each of the media material images to generate at least one media display image to obtain an emoticon package including the at least one media display image.
  11. 根据权利要求10所述的方法,其中,所述个人图像中包括个人的脸部图像,所述媒体素材图像中包括拼音中声母及韵母所对应的发音口型图像或英语音标中元音及辅音所对应的发音口型图像,所述将所述通话联系人的个人图像与每一所述媒体素材图像进行整合,生成至少一个媒体显示图像,得到包含所述至少一个媒体显示图像的表情包,包括:The method according to claim 10, wherein the personal image includes a facial image of a person, and the media material image includes a vocal vocal image corresponding to the initial and the final in the pinyin or a vowel and consonant in the English phonetic symbol. Corresponding vocal image, the personal image of the call contact is integrated with each of the media material images to generate at least one media display image, and an expression package including the at least one media display image is obtained. include:
    识别所述脸部图像中的嘴部区域;Identifying a mouth region in the facial image;
    将所述媒体素材图像中的发音口型图像在所述嘴部区域进行填充替换;Filling and replacing the voiced mouth image in the media material image in the mouth region;
    生成与所述媒体素材图像中的每一所述发音口型图像对应的媒体显示图像,得到包含与每一所述发音口型图像对应的媒体显示图像的表情包。Generating a media display image corresponding to each of the vocalization-type images in the media material image to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.
  12. 根据权利要求1所述的方法,其中,所述通过显示界面对所述至少 一帧目标媒体显示图像进行播放显示,包括:The method of claim 1 wherein said at least said display interface A frame of the target media display image for playback display, including:
    通过所述显示界面,以获取得到的一显示图像为背景画面,对所述至少一帧目标媒体显示图像进行播放显示(205)。And displaying, by using the display interface, the obtained display image as a background image, and displaying and displaying the at least one frame of the target media display image (205).
  13. 一种媒体显示终端,包括:A media display terminal comprising:
    采集模块(301),设置为采集目标通话语音;An acquisition module (301) configured to collect a target call voice;
    第一获取模块(302),设置为根据所述目标通话语音,获取与所述目标通话语音匹配的至少一帧目标媒体显示图像;The first obtaining module (302) is configured to acquire, according to the target call voice, at least one frame target media display image that matches the target call voice;
    显示模块(303),设置为通过显示界面对所述至少一帧目标媒体显示图像进行播放显示。The display module (303) is configured to perform play display on the at least one frame of the target media display image through the display interface.
  14. 根据权利要求13所述的媒体显示终端,其中,所第一获取模块(302),包括:The media display terminal according to claim 13, wherein the first obtaining module (302) comprises:
    第一确定子模块,设置为确定与所述目标通话语音对应的目标表情包;a first determining submodule, configured to determine a target emoticon packet corresponding to the target call voice;
    第二确定子模块,设置为依据所述目标通话语音的声纹特征,确定发出所述目标通话语音所需的至少一个目标发音口型;a second determining submodule, configured to determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice;
    获取子模块,设置为从所述目标表情包中匹配得到包含所述目标发音口型的至少一帧目标媒体显示图像。Obtaining a sub-module, configured to obtain at least one frame of the target media display image including the target pronunciation lip shape from the target emoticon package.
  15. 根据权利要求14所述的媒体显示终端,其中,所述第二确定子模块,包括:The media display terminal according to claim 14, wherein the second determining submodule comprises:
    第一确定单元,设置为依据所述目标通话语音的声纹特征,确定与所述目标通话语音对应的文字内容;a first determining unit, configured to determine, according to the voiceprint feature of the target call voice, text content corresponding to the target call voice;
    第一匹配单元,设置为依据所述文字内容,从文字与注音的对应关系表中,匹配与所述文字内容对应的注音组合;The first matching unit is configured to match the phonetic combination corresponding to the text content from the correspondence table between the text and the phonetic according to the text content;
    第二匹配单元,设置为依据所述注音组合,从注音与发音口型的对应关系表中,匹配对应的至少一个发音口型;The second matching unit is configured to match the corresponding at least one pronunciation port type from the correspondence table between the phonetic and the pronunciation type according to the phonetic combination;
    第二确定单元,设置为确定所述至少一个发音口型为发出所述目标通话语音所需的至少一个目标发音口型。The second determining unit is configured to determine that the at least one pronunciation port type is at least one target pronunciation type required to issue the target call voice.
  16. 根据权利要求15所述的媒体显示终端,其中,所述文字与注音的 对应关系表中包括汉字与拼音的对应关系表;所述第一匹配单元包括:A media display terminal according to claim 15, wherein said text and phonetic The correspondence table includes a correspondence table of Chinese characters and pinyin; the first matching unit includes:
    第一匹配子单元,设置为根据所述文字内容,从汉字与拼音的对应关系表中,匹配与所述文字内容对应的拼音组合;The first matching subunit is configured to match, according to the text content, a pinyin combination corresponding to the text content from a correspondence table between Chinese characters and pinyin;
    所述注音与发音口型的对应关系表中包括拼音与发音口型的对应关系表,所述第二匹配单元包括:The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and the second matching unit includes:
    第二匹配子单元,设置为依据所述拼音组合,从拼音与发音口型的对应关系表中,匹配对应的至少一个发音口型。The second matching subunit is configured to match the corresponding at least one pronunciation lip shape from the correspondence table between the pinyin and the pronunciation lip shape according to the pinyin combination.
  17. 根据权利要求16所述的媒体显示终端,其中,所述拼音与发音口型的对应关系表中包括拼音中声母及韵母与发音口型的对应关系;所述第二匹配子单元是设置为:The media display terminal according to claim 16, wherein the correspondence table between the pinyin and the pronunciation type includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; the second matching subunit is set as:
    根据所述拼音组合,确定所述拼音组合中所包含的声母和韵母,或者,确定所述拼音组合中所包含的韵母;Determining, according to the pinyin combination, an initial and a final included in the phonetic combination, or determining a final included in the pinyin combination;
    依据所述声母和韵母,或者,依据所述韵母,从所述声母及韵母与发音口型的对应关系中,匹配得到相对应的至少一个发音口型。According to the initial and the final, or according to the final, the corresponding at least one pronunciation type is obtained from the correspondence between the initial and the final and the pronunciation.
  18. 根据权利要求15所述的媒体显示终端,其中,所述文字与注音的对应关系表中包括英文与英语音标的对应关系表;所述第一匹配单元包括:The media display terminal according to claim 15, wherein the correspondence table between the characters and the phonetic includes a correspondence table of English and English phonetic symbols; the first matching unit includes:
    第三匹配子单元,设置为根据所述文字内容,从英文与英语音标的对应关系表中,匹配与所述文字内容对应的音标组合;The third matching subunit is configured to match the phonetic symbol corresponding to the text content from the correspondence table between the English and the English phonetic symbols according to the text content;
    所述注音与发音口型的对应关系表中包括英语音标与发音口型的对应关系表,所述第二匹配单元包括:The correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and the second matching unit includes:
    第四匹配子单元,设置为依据所述音标组合,从英语音标与发音口型的对应关系表中,匹配对应的至少一个发音口型。The fourth matching subunit is configured to match the corresponding at least one pronunciation port type from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.
  19. 根据权利要求18所述的媒体显示终端,其中,所述英语音标与发音口型的对应关系表中包括英语音标中元音及辅音与发音口型的对应关系;所述第四匹配子单元是设置为:The media display terminal according to claim 18, wherein the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence between the vowels in the English phonetic symbols and the consonants and the pronunciation mouth shape; the fourth matching subunit is Set as:
    根据所述音标组合,确定所述音标组合中所包含的元音和辅音,或者,确定所述音标组合中所包含的元音; Determining a vowel and a consonant included in the phonetic symbol combination according to the phonetic symbol combination, or determining a vowel contained in the phonetic symbol combination;
    依据所述元音和辅音,或者,依据所述元音,从所述元音及辅音与发音口型的对应关系中,匹配得到相对应的至少一个发音口型。According to the vowel and the consonant, or according to the vowel, from the correspondence between the vowel and the consonant and the pronunciation mouth shape, matching corresponds to at least one pronunciation mouth shape.
  20. 根据权利要求14所述的媒体显示终端,其中,所述第一确定子模块,包括:The media display terminal according to claim 14, wherein the first determining submodule comprises:
    第三确定单元,设置为确定所述目标通话语音所对应的目标联系人;a third determining unit, configured to determine a target contact corresponding to the target call voice;
    调取单元,设置为调取与所述目标联系人预关联的目标表情包。The retrieval unit is configured to retrieve a target expression package pre-associated with the target contact.
  21. 根据权利要求13至20中任一项所述的媒体显示终端,其中,所述采集模块(301),包括:The media display terminal according to any one of claims 13 to 20, wherein the acquisition module (301) comprises:
    监听子模块,设置为监听语音通话进程;The monitoring submodule is set to monitor the voice call process;
    第三确定子模块,设置为确定接收到的对方通话语音为所述目标通话语音。The third determining submodule is configured to determine that the received counterpart voice is the target call voice.
  22. 根据权利要求13-20任一项所述的媒体显示终端,还包括:The media display terminal according to any one of claims 13 to 20, further comprising:
    第二获取模块,设置为获取素材资源包及通话联系人的个人图像,其中,所述素材资源包中包含至少一个媒体素材图像;a second acquiring module, configured to acquire a personal image of the asset resource package and the call contact, wherein the asset resource package includes at least one media material image;
    生成模块,设置为将所述通话联系人的个人图像与每一所述媒体素材图像进行整合,生成至少一个媒体显示图像,得到包含所述至少一个媒体显示图像的表情包。And a generating module, configured to integrate the personal image of the call contact with each of the media material images to generate at least one media display image, to obtain an emoticon package including the at least one media display image.
  23. 根据权利要求22所述的媒体显示终端,其中,所述个人图像中包括个人的脸部图像,所述媒体素材图像中包括拼音中声母及韵母所对应的发音口型图像或英语音标中元音及辅音所对应的发音口型图像,所述生成模块,包括:The media display terminal according to claim 22, wherein the personal image includes a facial image of a person, and the media material image includes a vocal vocal image corresponding to the initial and the final in the pinyin or a vowel in the English phonetic symbol. And the pronunciation port image corresponding to the consonant, the generating module includes:
    识别子模块,设置为识别所述脸部图像中的嘴部区域;Identifying a sub-module, configured to identify a mouth region in the facial image;
    替换模块,设置为将所述媒体素材图像中的发音口型图像在所述嘴部区域进行填充替换;a replacement module, configured to fill and replace the voiced mouth image in the media material image in the mouth region;
    生成子模块,设置为生成与所述媒体素材图像中的每一所述发音口型图像对应的媒体显示图像,得到包含与每一所述发音口型图像对应的媒体显示图像的表情包。 And generating a sub-module, configured to generate a media display image corresponding to each of the vocal-mouth images in the media material image, to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.
  24. 根据权利要求13所述的媒体显示终端,其中,所述显示模块,包括:The media display terminal according to claim 13, wherein the display module comprises:
    显示子模块,设置为通过所述显示界面,以获取得到的一显示图像为背景画面,对所述至少一帧目标媒体显示图像进行播放显示。 The display sub-module is configured to obtain, by using the display interface, a obtained display image as a background image, and play and display the at least one frame of the target media display image.
PCT/CN2017/114843 2016-12-14 2017-12-06 Medium displaying method and terminal WO2018108013A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611154485.5 2016-12-14
CN201611154485.5A CN108234735A (en) 2016-12-14 2016-12-14 A kind of media display methods and terminal

Publications (1)

Publication Number Publication Date
WO2018108013A1 true WO2018108013A1 (en) 2018-06-21

Family

ID=62557913

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/114843 WO2018108013A1 (en) 2016-12-14 2017-12-06 Medium displaying method and terminal

Country Status (2)

Country Link
CN (1) CN108234735A (en)
WO (1) WO2018108013A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021083125A1 (en) * 2019-10-31 2021-05-06 Oppo广东移动通信有限公司 Call control method and related product
CN112770063A (en) * 2020-12-22 2021-05-07 北京奇艺世纪科技有限公司 Image generation method and device

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109377540B (en) * 2018-09-30 2023-12-19 网易(杭州)网络有限公司 Method and device for synthesizing facial animation, storage medium, processor and terminal
CN110062116A (en) * 2019-04-29 2019-07-26 上海掌门科技有限公司 Method and apparatus for handling information
CN110336733B (en) * 2019-04-30 2022-05-17 上海连尚网络科技有限公司 Method and equipment for presenting emoticon
CN110784762B (en) * 2019-08-21 2022-06-21 腾讯科技(深圳)有限公司 Video data processing method, device, equipment and storage medium
CN110446066B (en) * 2019-08-28 2021-11-19 北京百度网讯科技有限公司 Method and apparatus for generating video
CN111063339A (en) * 2019-11-11 2020-04-24 珠海格力电器股份有限公司 Intelligent interaction method, device, equipment and computer readable medium
CN112804440B (en) * 2019-11-13 2022-06-24 北京小米移动软件有限公司 Method, device and medium for processing image
CN111596841B (en) * 2020-04-28 2021-09-07 维沃移动通信有限公司 Image display method and electronic equipment
CN111741162B (en) * 2020-06-01 2021-08-20 广东小天才科技有限公司 Recitation prompting method, electronic equipment and computer readable storage medium
EP3993410A1 (en) * 2020-10-28 2022-05-04 Ningbo Geely Automobile Research & Development Co., Ltd. A camera system and method for generating an eye contact image view of a person
CN114827648B (en) * 2022-04-19 2024-03-22 咪咕文化科技有限公司 Method, device, equipment and medium for generating dynamic expression package

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539354B1 (en) * 2000-03-24 2003-03-25 Fluent Speech Technologies, Inc. Methods and devices for producing and using synthetic visual speech based on natural coarticulation
CN101482975A (en) * 2008-01-07 2009-07-15 丰达软件(苏州)有限公司 Method and apparatus for converting words into animation
CN101968893A (en) * 2009-07-28 2011-02-09 上海冰动信息技术有限公司 Game sound-lip synchronization system
CN104238991A (en) * 2013-06-21 2014-12-24 腾讯科技(深圳)有限公司 Voice input matching method and voice input matching device
CN104239394A (en) * 2013-06-18 2014-12-24 三星电子株式会社 Translation system comprising display apparatus and server and control method thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104468959A (en) * 2013-09-25 2015-03-25 中兴通讯股份有限公司 Method, device and mobile terminal displaying image in communication process of mobile terminal

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539354B1 (en) * 2000-03-24 2003-03-25 Fluent Speech Technologies, Inc. Methods and devices for producing and using synthetic visual speech based on natural coarticulation
CN101482975A (en) * 2008-01-07 2009-07-15 丰达软件(苏州)有限公司 Method and apparatus for converting words into animation
CN101968893A (en) * 2009-07-28 2011-02-09 上海冰动信息技术有限公司 Game sound-lip synchronization system
CN104239394A (en) * 2013-06-18 2014-12-24 三星电子株式会社 Translation system comprising display apparatus and server and control method thereof
CN104238991A (en) * 2013-06-21 2014-12-24 腾讯科技(深圳)有限公司 Voice input matching method and voice input matching device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021083125A1 (en) * 2019-10-31 2021-05-06 Oppo广东移动通信有限公司 Call control method and related product
CN112770063A (en) * 2020-12-22 2021-05-07 北京奇艺世纪科技有限公司 Image generation method and device
CN112770063B (en) * 2020-12-22 2023-07-21 北京奇艺世纪科技有限公司 Image generation method and device

Also Published As

Publication number Publication date
CN108234735A (en) 2018-06-29

Similar Documents

Publication Publication Date Title
WO2018108013A1 (en) Medium displaying method and terminal
CN110941954B (en) Text broadcasting method and device, electronic equipment and storage medium
CN110288077B (en) Method and related device for synthesizing speaking expression based on artificial intelligence
JP6019108B2 (en) Video generation based on text
CN112188304B (en) Video generation method, device, terminal and storage medium
US20190222806A1 (en) Communication system and method
CN111106995B (en) Message display method, device, terminal and computer readable storage medium
JP2014519082A5 (en)
CN111294463B (en) Intelligent response method and system
CA2677051A1 (en) A communication network and devices for text to speech and text to facial animation conversion
CN107291704A (en) Treating method and apparatus, the device for processing
CN112509609B (en) Audio processing method and device, electronic equipment and storage medium
KR20150017662A (en) Method, apparatus and storing medium for text to speech conversion
CN110990534A (en) Data processing method and device and data processing device
CN111199160A (en) Instant call voice translation method and device and terminal
CN113724683A (en) Audio generation method, computer device, and computer-readable storage medium
CN110830845A (en) Video generation method and device and terminal equipment
CN110298150B (en) Identity verification method and system based on voice recognition
CN111160051B (en) Data processing method, device, electronic equipment and storage medium
CN114398517A (en) Video data acquisition method and device
CN111462279B (en) Image display method, device, equipment and readable storage medium
CN112837668A (en) Voice processing method and device for processing voice
CN108174123A (en) Data processing method, apparatus and system
CN112562687B (en) Audio and video processing method and device, recording pen and storage medium
KR20180034927A (en) Communication terminal for analyzing call speech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17880305

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17880305

Country of ref document: EP

Kind code of ref document: A1