WO2018108013A1 - Procédé et terminal d'affichage de support - Google Patents

Procédé et terminal d'affichage de support Download PDF

Info

Publication number
WO2018108013A1
WO2018108013A1 PCT/CN2017/114843 CN2017114843W WO2018108013A1 WO 2018108013 A1 WO2018108013 A1 WO 2018108013A1 CN 2017114843 W CN2017114843 W CN 2017114843W WO 2018108013 A1 WO2018108013 A1 WO 2018108013A1
Authority
WO
WIPO (PCT)
Prior art keywords
pronunciation
target
phonetic
image
correspondence table
Prior art date
Application number
PCT/CN2017/114843
Other languages
English (en)
Chinese (zh)
Inventor
张长帅
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2018108013A1 publication Critical patent/WO2018108013A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72484User interfaces specially adapted for cordless or mobile telephones wherein functions are triggered by incoming communication events
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72439User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for image or video messaging
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/725Cordless telephones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2250/00Details of telephonic subscriber devices
    • H04M2250/74Details of telephonic subscriber devices with voice recognition means

Definitions

  • This document relates to, but is not limited to, the field of communication technologies, and in particular, to a media display method and terminal.
  • the mobile phone can enter the network environment through a SIM card or a wireless WIFI to realize a video call.
  • a media display method and terminal are provided in an embodiment of the present invention.
  • an embodiment of the present invention provides a media display method, including:
  • the at least one frame of the target media display image is played and displayed through the display interface.
  • the acquiring, according to the target call voice, the at least one frame target media display image that matches the target call voice includes:
  • the determining, according to the voiceprint feature of the target call voice, determining at least one target pronunciation type required to send the target call voice including:
  • Determining the at least one pronunciation port type is at least one target pronunciation port type required to issue the target call voice.
  • the determining, according to the voiceprint feature of the target call voice, determining at least one target pronunciation type required to send the target call voice including:
  • At least one corresponding pronunciation type is matched from the correspondence table between the phonetic and the pronunciation mouth type;
  • Determining the at least one pronunciation port type is at least one target pronunciation port type required to issue the target call voice.
  • the correspondence table between the text and the phonetic includes a correspondence table between Chinese characters and pinyin; and according to the text content, matching the phonetic corresponding to the text content from the correspondence table between the characters and the phonetic Combinations, including:
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and according to the phonetic combination, matching at least one pronunciation port from the correspondence table between the phonetic transcription and the pronunciation mouth shape Type, including:
  • At least one corresponding pronunciation type is matched from the correspondence table between the pinyin and the pronunciation type.
  • the correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence relationship between the initials and the finals in the pinyin and the pronunciation mouth shape; and according to the pinyin combination, from the correspondence table between the pinyin and the pronunciation mouth shape, Match the corresponding at least one pronunciation type, including:
  • the corresponding at least one pronunciation type is obtained from the correspondence between the initial and the final and the pronunciation.
  • the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; and according to the text content, matching the correspondence between the text and the phonetic to the text content Phonetic combination, including:
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and according to the phonetic transcription combination, matching at least one pronunciation corresponding from the correspondence table between the phonetic transcription and the pronunciation mouth shape Mouth type, including:
  • At least one corresponding pronunciation type is matched from the correspondence table between the English phonetic symbols and the pronunciation mouth shape.
  • the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels and the consonants and the pronunciation mouth shape in the English phonetic symbols; and the correspondence between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination
  • at least one pronunciation type corresponding to the matching including:
  • the matching obtains at least one corresponding mouth shape.
  • the determining a target emoticon packet corresponding to the target call voice includes:
  • the collecting the target call voice includes:
  • the method before the step of collecting the target call voice, the method further includes:
  • the personal image includes a facial image of a person, where the media material image includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol.
  • the personal image of the call contact with each of the media material images to generate at least one media display image, and obtaining an expression package including the at least one media display image, including:
  • the displaying, by the display interface, the at least one frame of the target media display image including:
  • the embodiment of the present invention further provides a media display terminal, including:
  • the acquisition module is configured to collect the target call voice
  • a first acquiring module configured to acquire, according to the target call voice, at least one frame target media display image that matches the target call voice
  • the display module is configured to perform play display on the at least one frame of the target media display image through the display interface.
  • the first acquiring module includes:
  • a first determining submodule configured to determine a target emoticon packet corresponding to the target call voice
  • a second determining submodule configured to determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice
  • Obtaining a sub-module configured to obtain at least one frame of the target media display image including the target pronunciation lip shape from the target emoticon package.
  • the second determining submodule includes:
  • a first determining unit configured to determine, according to the voiceprint feature of the target call voice, text content corresponding to the target call voice
  • the first matching unit is configured to match the corresponding pinyin combination from a correspondence table of a text and a pinyin according to the text content;
  • the second matching unit is configured to match the corresponding at least one pronunciation port type from a correspondence table between a pinyin and a pronunciation type according to the pinyin combination;
  • the second determining unit is configured to determine that the at least one pronunciation port type is at least one target pronunciation type required to issue the target call voice.
  • the correspondence table between the text and the phonetic includes a correspondence table between Chinese characters and pinyin;
  • the first matching unit includes:
  • the first matching subunit is configured to match, according to the text content, a pinyin combination corresponding to the text content from a correspondence table between Chinese characters and pinyin;
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and the second matching unit includes:
  • the second matching subunit is configured to match the corresponding at least one pronunciation lip shape from the correspondence table between the pinyin and the pronunciation lip shape according to the pinyin combination.
  • the correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; the second matching subunit is set as:
  • the corresponding at least one pronunciation type is obtained from the correspondence between the initial and the final and the pronunciation.
  • the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols;
  • the first matching unit includes:
  • the third matching subunit is configured to match the phonetic symbol corresponding to the text content from the correspondence table between the English and the English phonetic symbols according to the text content;
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and the second matching unit includes:
  • the fourth matching subunit is configured to match the corresponding at least one pronunciation port type from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.
  • the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels and the consonants and the pronunciation mouth shape in the English phonetic symbols; the fourth matching subunit is set as:
  • matching corresponds to at least one pronunciation mouth shape.
  • the first determining submodule includes:
  • a third determining unit configured to determine a target contact corresponding to the target call voice
  • the retrieval unit is configured to retrieve a target expression package pre-associated with the target contact.
  • the collecting module (301) includes:
  • the monitoring submodule is set to monitor the voice call process
  • the third determining submodule is configured to determine that the received counterpart voice is the target call voice.
  • the media display terminal further includes:
  • a second acquiring module configured to acquire a personal image of the asset resource package and the call contact, wherein the asset resource package includes at least one media material image
  • a generating module configured to integrate the personal image of the call contact with each of the media material images to generate at least one media display image, to obtain an emoticon package including the at least one media display image.
  • the personal image includes a facial image of the individual, where the media material image includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol.
  • the generating module includes:
  • a replacement module configured to fill and replace the voiced mouth image in the media material image in the mouth region
  • generating a sub-module configured to generate a media display image corresponding to each of the vocal-mouth images in the media material image, to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.
  • the display module includes:
  • the display sub-module is configured to obtain, by using the display interface, a obtained display image as a background image, and play and display the at least one frame of the target media display image.
  • the virtual video telephone communication can be broken through the network environment limitation. The process does not depend on the network, saves traffic or even gets rid of the traffic restriction, and makes the video picture accompanying virtualization in the call process, and the call process is more fresh. Live and smart, strengthen the communication effect, add communication fun, and prompt the user experience.
  • FIG. 1 is a schematic flow chart of a media display method in an embodiment of the present invention
  • FIG. 2 is a schematic flow chart of another media display method in an embodiment of the present invention.
  • FIG. 3 is a block diagram showing the structure of a media display terminal in an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of an operator template in an embodiment of the present invention.
  • the mobile phone can enter the network environment through SIM card or wireless WIFI to realize video call
  • the video call often needs to rely on wireless networking, and the actual situation is not wireless network coverage anytime and anywhere, and the video call is connected through the telephone card, which is expensive.
  • the method relies on the network environment and is limited to the constraints of the network environment. Video conversation cannot be the normal state of communication, and when people talk and communicate, only voice communication is not flexible and vivid, and the user experience is insufficient.
  • a media display method is disclosed in the embodiment of the present invention.
  • Step 101 Collect a target call voice.
  • This step may occur during a normal telephone call or during the playback of voice in an instant messaging software such as WeChat, and the target call voice may be the voice of the party in the call or the voice of the contact who is talking to himself.
  • the target call voice forms the next processing object.
  • the step of collecting the target call voice may include: monitoring a voice call process; and determining that the received party voice is the target call voice. That is, the call voice of the other party in the call is collected, and the next process is performed to realize the fun call.
  • Step 102 Acquire at least one frame target media display image that matches the target call voice according to the target call voice.
  • the at least one frame of the target media display image may be selected from a set of image packets.
  • the at least one frame of the target media display image may be within the conversation with the target call voice Matching the matching, or matching the sound tone, volume, or vocalization required by the target conversation voice, the target media display image may be an expression, or a physical motion, or a different symbol, or a voice The location and context of the content.
  • Step 103 Play and display the at least one frame of the target media display image through the display interface.
  • the limb motion may be switched with the voice.
  • the process matches at least one frame of the target media display image, and plays and displays the target media display image to form a video style play effect, thereby realizing the voice recognition to be converted into a video display, which can be performed by the terminal device.
  • the local software application mode breaks through the limitations of the network environment and communicates with virtual video phones. The process does not depend on the network, saves traffic and even gets rid of traffic restrictions, so that the video picture accompanying virtualization during the call, the call process is more lively and smart, and strengthens.
  • the communication effect adds communication fun and prompts the user experience.
  • a media display method is also disclosed in the embodiment of the present invention.
  • Step 201 Collect a target call voice.
  • the step of collecting the target call voice may include: monitoring a voice call process; and determining that the received party voice is the target call voice. That is, the call voice of the other party in the call is collected, and the next process is performed to realize the fun call.
  • Step 202 Determine a target emoticon packet corresponding to the target call voice.
  • the target emoticon package may be a fixed emoticon package that matches the different target call speech, and is directly determined and read from the storage device; or is targeted to be in accordance with the target call voice.
  • the expression pack that changes with a specific element needs to be matched and determined according to the target call voice.
  • the step of determining a target emoticon packet corresponding to the target call voice includes:
  • the emoticon package may be a resource content that is associated with the target contact, for example, when the contact is known to call the user's mobile phone, the user's mobile phone can obtain Knowing that the target call voice is sent by the contact, according to the address book information, adapting the emoticon packet corresponding to the caller's contact, which may be the photo of the target contact or a specific picture, and make different settings for the specific contact. To make the display results more targeted and more interesting.
  • Step 203 Determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice.
  • Different call voices will correspond to different voice mouth patterns, and different voices have different voiceprint feature information.
  • the desired voice pattern can be determined according to the voiceprint features in the collected target call voice.
  • the step of determining at least one target pronunciation type required to send the target call voice according to the voiceprint feature of the target call voice includes:
  • the phonetic transcription is the content of the pronunciation of the text, the different languages correspond to different characters, and the different characters correspond to different phonetic systems. For example, when the text is Chinese, the corresponding phonetic transcription is pinyin, when the text is English The corresponding phonetic is the English phonetic symbol. According to the voiceprint feature of the target call voice, the process of determining the corresponding voice mouth shape needs to first convert the voice into text, and match the corresponding phonetic combination through the text, thereby matching the target pronunciation type corresponding to the phonetic combination.
  • the process is implemented as follows: receiving voice, quantizing the voice data, and calling the open source interface to implement voice conversion text.
  • the principle is that different voices have different voiceprint feature information, and the voiceprint feature information and text comparison can be recorded in advance. Generate a database correspondence. After capturing the new voice, compare it with the pre-recorded database to find the corresponding text, and the text conversion phonetic process is also the same.
  • Pre-record the contrast between the different phonetic and corresponding texts write Array, generate the database correspondence, after the voice conversion text, and then find the phonetic into the database, and then from the corresponding table of the phonetic and pronunciation mouth type preset in the database, get the target pronunciation type corresponding to the phonetic, the phonetic will be Split and fit the emoticon image, quickly switch the display, and generate a virtual video effect.
  • the correspondence table between the character and the phonetic includes a correspondence table between the Chinese character and the pinyin;
  • the phonetic combination corresponding to the text content is matched, including:
  • a pinyin combination corresponding to the text content is matched from the correspondence table between the Chinese character and the pinyin.
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and according to the phonetic combination, matching at least one pronunciation port from the correspondence table between the phonetic transcription and the pronunciation mouth shape
  • the method includes: matching, according to the pinyin combination, the corresponding at least one pronunciation mouth shape from the correspondence table between the pinyin and the pronunciation mouth shape.
  • the process of converting Chinese characters into pinyin is also the same.
  • Pre-recording the relationship between Pinyin and Chinese characters reading the Chinese Pinyin table in the standard GBK character set database, writing the array, generating the database correspondence, after the speech conversion Chinese characters, the Chinese characters are compared with the database.
  • the target pronunciation type corresponding to the pinyin is obtained, the pinyin is split and adapted to the expression image, and the display is quickly switched to generate a virtual video effect.
  • the correspondence between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; and the correspondence between the pinyin and the pronunciation type is according to the pinyin combination.
  • the matching obtains at least one corresponding pronunciation type.
  • the pinyin is positioned and split in the array structure to separate the initials and finals, since some Chinese pronunciations only correspond to separate finals. Therefore, it is determined that the obtained pinyin combination may include a combination of initials and finals, or only a finals, and the emoticons are adapted according to the initials of the initials and the finals, and the display is switched in the display window, and the expressions are quickly switched to generate a virtual video presentation form.
  • the correspondence table between the pinyin and the pronunciation port type may also be generated by implementing import or download.
  • the storage device is provided with a lip-shaped resource library of Chinese pinyin letters.
  • each lip-shaped image in the library corresponds to one.
  • the pronunciation example of the pinyin letters, the mouth pattern and the initials and the finals in the pinyin table correspond one-to-one, so that each pronunciation can find the corresponding initial and final image combination in the mouth type library, so that the pronunciation mouth type is truly and actual.
  • the mouth of the person's mouth is consistent in appearance.
  • the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; And matching the phonetic combination corresponding to the text content from the correspondence table between the text and the phonetic, including:
  • a phonetic combination corresponding to the text content is matched from the correspondence table between the English and the English phonetic symbols.
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and according to the phonetic combination, the correspondence between the phonetic transcription and the pronunciation mouth shape is matched, at least A pronunciation mouth shape includes: matching at least one pronunciation mouth shape from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.
  • the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels in the English phonetic symbols and the consonants and the pronunciation mouth shape; the sound phone combination is based on the phonetic symbols and the pronunciation type
  • matching at least one pronunciation port type including:
  • the process is matched to the text content in the correspondence table between English and English phonetic symbols.
  • the English phonetic symbols are located in the array structure to locate the split vowels and consonants. Since some English pronunciations only correspond to individual vowels, the determined pinyin combinations may contain a combination of vowels and consonants, or It is only a vowel, which adapts the emoticon according to the vowel and consonant corresponding mouth shape, and performs switching display in the display window, and the expression quickly switches to generate a virtual video presentation form.
  • the correspondence table between the English phonetic symbols and the pronunciation mouth shape may also be generated by implementing import or download.
  • the solution is based on a mobile phone, and the virtual video is taken as an example.
  • the storage device is provided with a mouth-shaped resource library of English phonetic symbols.
  • each port image in the library corresponds to one.
  • the pronunciation example of the English phonetic symbol, the mouth pattern and the vowel consonants in the English phonetic table are in one-to-one correspondence, so that each pronunciation can find the corresponding vowel and consonant pronunciation image in the mouth type library, and the combination is associated to make the pronunciation port
  • the type is really consistent with the actual mouth of the caller.
  • Step 204 Match at least one frame of the target media display image including the target vocalization pattern from the target expression package.
  • the target emoticon package is an emoticon package containing different mouth shapes
  • the media display image included in the mouth emoticon package is also a display image with different shapes for the mouth type, and optionally, the target media display
  • the image can also contain other changing elements, such as eyebrow changes, expressions and face types, etc., to match the mouth shape.
  • Step 205 Perform play display on at least one frame of the target media display image through the display interface.
  • the user terminal for example, a mobile phone
  • the user terminal adapts the associated expression according to the voice recognition, and optimizes the corresponding image resource synthesis, so that the picture is continuously played with the voice, and the expression is continuously switched and updated.
  • the virtual video scene is generated in a network-free environment, and the scene of virtual display video dialogue is realized, so that the dialogue communication is more effective and interesting.
  • Step 1 Read the Chinese pinyin content in the standard Chinese character encoding character set (GBK) database in advance, and write the Chinese pinyin data into the array.
  • Each element of the array is a structure, and the structure is divided into four parts, pinyin and pinyin. The initials after splitting, the finals, and the Chinese characters corresponding to Pinyin.
  • Step 2 The voice is converted into text through the open source interface, and the text is searched in the array as a key, and the corresponding pinyin can be found, and then the pinchin is split into the corresponding initial and initials.
  • Step 3 The initials and the finals look up the associated expressions, create a visual dialog on the desktop as a user interface (UI) window, and present the expressions in the UI window.
  • UI user interface
  • the time length of the target call voice may be calculated first, and according to the time length of the target call voice, the target media display image is continuously played for each frame.
  • the playing time when displaying, the playing time of the playing time is changed continuously, and the length of the target call voice is divided by the number of frames of the target media display image, and the playing time of one frame of the image is obtained, to "Hello".
  • the voice "hello" time is 1 second
  • the voice is decomposed into four images corresponding to the expression, and each image is displayed for 1/4 second, that is, each image is displayed for 250 milliseconds, which is obtained according to the target call voice.
  • At least one frame of the target media display image that matches the target call voice is continuously switched and presented.
  • the expression presented by the window corresponds to the received voice.
  • the user's expression package is quickly switched in the UI window, that is, the video effect is presented, thereby realizing the virtual video call function.
  • the step of performing play display on the at least one frame of the target media display image by using the display interface includes: obtaining, by using the display interface, a obtained display image as a background image, and the at least one frame The target media displays an image for playback display.
  • a play background When playing a display of the target media display image, a play background may be added, the background may be a fixed background corresponding to the set contact, or may be based on a keyword or word recognized in the target call voice.
  • the matching acquires the corresponding image as a background image, so that the background image can be changed according to the user's voice content.
  • This process mainly performs voice matching, performs voice matching, and performs synthesis processing on known scene images (background images) in the library without a wireless network environment, thereby realizing virtual scene reproduction in a network-free environment.
  • scene images background images
  • the scenes and expressions that are constantly switched can realize the scene of virtual display video dialogue, making dialogue communication more effective and interesting.
  • the core implementation process is as follows. For example, when a call voice conversation is performed on a mobile phone, the mobile phones of both parties have pre-stored expressions and scene images. When receiving the voice of the other party, the local priority activates the voice recognition module, and the corresponding media display is switched by the recognition. Resources, displayed in the terminal dialogue interface, voice The voice recognition continuously adapts to the corresponding expression resources, and the terminal interface can display a fixed background image or adapt the corresponding background image according to the voice content, and the expression matches the background image in the terminal interface, and the visual effect of the fast switching is reality. The video conversation scene, thus realizing the virtual reality video dialogue.
  • the method before the step of collecting the target call voice, the method further includes:
  • the personal image may be a personal expression image or an image associated with the call contact.
  • the process corresponds to an initialization process of the emoticon packet containing the media display image to enable acquisition of at least one frame of the target media display image in accordance with the target call speech.
  • the asset package can be downloaded in advance when there is a network.
  • the process of integrating the personal image with each media material image may be implemented by, for example, image filling, partial replacement, partial coverage, and the like.
  • the user installs the virtual simulation software in the mobile phone, and the software can arbitrarily set the image scene, upload the user image, initialize the preset, and generate the user expression package.
  • the user's mobile phone stores the required image resources, which may be pre-camera or network download to the mobile phone local, usually including the user's personal image, the material resource package, and the scene picture of the regular video conversation.
  • the material resource package is, for example, a lip-type resource package, where the mouth is The type resource package is integrated by the software of the embodiment of the present invention and provides user use. Taking the mouth resource package as an example, the software itself integrates the lip image resource.
  • the user provides a personal image before use, initializes the software, initializes the lip image and the personal image, and performs image optimization to synthesize the lip package.
  • an emoticon packet corresponding to the pinyin alphabet of the user is generated.
  • the image synthesis technology first clears the lip and edge regions of the user's face image, and superimposes the same size of the lip-shaped resources, and then optimizes the image to obtain a user-defined emoticon packet corresponding to the pinyin alphabet.
  • the personal image includes a facial image of the individual
  • the media material image includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol.
  • integrating the personal image of the call contact with each of the media material images to generate at least one media display image, and obtaining at least one media display The steps of the image emoticon package include:
  • the media corresponding to the image displays an image, and an emoticon packet including a media display image corresponding to each of the vocal vocal images is obtained.
  • the technical implementation steps of the face and mouth synthesis expression package are as follows:
  • Step 1 Taking a call between two parties, for example, in the voice communication device held by A, the image resource of B is stored in advance, and the lip image resource corresponding to all the alphabetic pronunciation of the Chinese pinyin alphabet is stored in advance.
  • Step 2 Convert the face and mouth color image into a grayscale image.
  • Gray R*0.299+G*0.587+B*0.114, in order to avoid low-speed floating-point operations, The integer algorithm is introduced and rounded off to obtain an equivalent variant algorithm.
  • Gray (R*30+G*59+B*11+50)/100, which improves the computational conversion efficiency.
  • the grayscale image takes the grayscale threshold, uses the threshold to segment the face, implements the mouth region detection, performs edge detection on the face image, and uses the mean operator template (ie, each pixel value is calculated and updated to the adjacent pixel).
  • Mean processing the grayscale image, can detect the feature area of the face, such as the eye area and the like, or identify the mouth area according to the symmetry and structure distribution of the face.
  • Step 4 Replace the original lip type resource with the lip area detected in the face area, generate an expression, and quantize the pixel value of the lip image generated in step 1 to make the number of pixels and step 3
  • the number of pixels in the lip-shaped area is the same, and the sampled and quantized lip-shaped resource is replaced and filled into the lip-shaped area of step 3, and the generated facial expression image is reconstructed.
  • Step 5 Perform Gaussian filtering on the newly generated expression image, enhance the smoothness of the synthesized image, and generate an expression resource library.
  • Each expression image in the library is a pronunciation type of the face type corresponding to the alphabet.
  • the face correspondence and the different mouth shapes are combined to generate different expression images, and each image corresponds to the pronunciation of the pinyin letters.
  • the newly generated facial expression image needs to be denoised and smoothed by Gaussian filtering to obtain a clear expression image.
  • the template calculated by the Gaussian function is a floating-point number.
  • an integer 5 ⁇ 5 template operator is used, and the coefficient is 1/273. As shown in Figure 4.
  • the process according to the collected call voice, at least one frame of the target media display image is matched, and the target media display images are continuously played and displayed to form a video style playing effect, and the voice recognition is converted into a video display, which can be adopted.
  • the local software application mode of the terminal device breaks through the limitation of the network environment and performs virtual video telephone communication. The process does not depend on the network, saves traffic and even gets rid of the traffic restriction, so that the video picture accompanying virtualization during the call is more vivid and smart. It enhances the communication effect and adds communication fun.
  • the embodiment of the present invention further discloses a media display terminal, which is shown in FIG. 3, and includes an acquisition module 301, a first acquisition module 302, and a display module 303.
  • the media display terminal may be a terminal such as a smart watch or a mobile phone that supports voice.
  • the collecting module 301 is configured to collect a target call voice.
  • the first obtaining module 302 is configured to acquire, according to the target call voice, at least one frame target media display image that matches the target call voice.
  • the display module 303 is configured to perform play display on the at least one frame of the target media display image through the display interface.
  • the first acquiring module includes:
  • the first determining submodule is configured to determine a target emoticon packet corresponding to the target call voice.
  • the second determining submodule is configured to determine at least one target pronunciation type required to issue the target call voice according to the voiceprint feature of the target call voice.
  • Obtaining a sub-module configured to obtain at least one frame of the target media display image including the target pronunciation lip shape from the target emoticon package.
  • the second determining submodule includes:
  • the first determining unit is configured to determine the text content corresponding to the target call voice according to the voiceprint feature of the target call voice.
  • the first matching unit is configured to match the phonetic combination corresponding to the text content from the correspondence table between the text and the phonetic according to the text content.
  • a second matching unit configured to correspond to the phonetic and pronunciation type according to the phonetic combination In the table, at least one corresponding pronunciation type is matched.
  • the second determining unit is configured to determine that the at least one pronunciation port type is at least one target pronunciation type required to issue the target call voice.
  • the correspondence table between the characters and the phonetic includes a correspondence table between Chinese characters and pinyin; the first matching unit includes:
  • the first matching subunit is configured to match the pinyin combination corresponding to the text content from the correspondence table between the Chinese character and the pinyin according to the text content.
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the pinyin and the pronunciation mouth shape, and the second matching unit includes:
  • the second matching subunit is configured to match the corresponding at least one pronunciation lip shape from the correspondence table between the pinyin and the pronunciation lip shape according to the pinyin combination.
  • the correspondence table between the pinyin and the pronunciation mouth shape includes a correspondence between the initials and the finals in the pinyin and the pronunciation type; the second matching subunit is set as:
  • the matching obtains at least one corresponding pronunciation type.
  • the correspondence table between the text and the phonetic includes a correspondence table between the English and the English phonetic symbols; the first matching unit includes:
  • the third matching subunit is configured to match the phonetic symbol corresponding to the text content from the correspondence table between the English and the English phonetic symbols according to the text content.
  • the correspondence table between the phonetic transcription and the pronunciation mouth shape includes a correspondence table between the English phonetic symbols and the pronunciation mouth shape, and the second matching unit includes:
  • the fourth matching subunit is configured to match the corresponding at least one pronunciation port type from the correspondence table between the English phonetic symbols and the pronunciation mouth shape according to the phonetic symbol combination.
  • the correspondence table between the English phonetic symbols and the pronunciation mouth shape includes a correspondence relationship between the vowels in the English phonetic symbols and the consonants and the pronunciation mouth shape; the fourth matching subunit is set as:
  • the first determining submodule includes:
  • the third determining unit is configured to determine a target contact corresponding to the target call voice.
  • the retrieval unit is configured to retrieve a target expression package pre-associated with the target contact.
  • the collection module includes:
  • the monitor submodule is set to listen to the voice call process.
  • the third determining submodule is configured to determine that the received counterpart voice is the target call voice.
  • the terminal further includes:
  • the second obtaining module is configured to obtain a personal image of the asset resource package and the call contact, wherein the asset resource package includes at least one media material image.
  • a generating module configured to integrate the personal image of the call contact with each of the media material images to generate at least one media display image, to obtain an emoticon package including the at least one media display image.
  • the personal image includes a facial image of a person
  • the image of the media material includes a pronunciation mouth shape image corresponding to the initial and the final voice in the pinyin, or a pronunciation mouth image corresponding to the vowel and the consonant in the English phonetic symbol.
  • the generation module includes:
  • An identification sub-module is provided to identify a mouth region in the facial image.
  • a replacement module configured to fill and replace the voiced mouth image in the media material image in the mouth region.
  • generating a sub-module configured to generate a media display image corresponding to each of the vocal-mouth images in the media material image, to obtain an expression package including a media display image corresponding to each of the vocal-mouth images.
  • the display module includes:
  • the display sub-module is configured to obtain, by using the display interface, a obtained display image as a background image, and play and display the at least one frame of the target media display image.
  • the media display terminal matches at least one frame of the target media display image according to the collected call voice, and performs continuous play display on the target media display images to form a video style play effect, and realizes voice recognition to be converted into a video display.
  • the virtual video telephone communication can be broken through the network environment limitation. The process does not depend on the network, saves traffic and even gets rid of the traffic restriction, so that the virtualized video picture is accompanied during the call, and the call process is more vivid. Smart, enhanced communication, and increased communication fun.
  • the embodiment of the invention further provides a computer readable storage medium storing computer executable instructions, which are implemented by the processor to implement the method described in the foregoing embodiments.
  • computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules, or other data. , removable and non-removable media.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridge, magnetic tape, magnetic disk storage or other magnetic storage device, or may Any other medium used to store the desired information and that can be accessed by the computer.
  • communication media typically comprise computer readable instructions, data structures, program modules or such as a carrier wave or He transmits other data in the data signal, such as a transmission mechanism, and may include any information delivery medium.
  • relational terms such as first and second, etc. are merely used to distinguish one entity or operation from another entity or operation, without necessarily requiring or Imply that there is any such actual relationship or order between these entities or operations.
  • the terms “comprises” or “comprising” or “comprising” or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, or terminal device that includes a plurality of elements includes not only those elements but also Other elements that are included, or include elements inherent to such a process, method, article, or terminal device.
  • An element defined by the phrase “comprising a " does not exclude the presence of additional identical elements in the process, method, article, or terminal device that comprises the element, without further limitation.
  • the embodiment of the present invention breaks through the limitation of the network environment and performs virtual video telephone communication.
  • the process does not depend on the network, saves traffic and even gets rid of the traffic restriction, so that the video picture accompanying virtualization during the call process is more vivid and smart, and the communication is strengthened.
  • the effect adds fun to the communication and prompts the user experience.

Abstract

La présente invention porte sur un procédé et un terminal d'affichage de support. Le procédé consiste à : acquérir une voix d'appel cible (101) ; acquérir, en fonction de la voix d'appel cible, au moins une trame d'image d'affichage de support cible correspondant à la voix d'appel cible (102) ; et afficher, par l'intermédiaire d'une interface d'affichage, la ou les trames d'image d'affichage de support cible (103).
PCT/CN2017/114843 2016-12-14 2017-12-06 Procédé et terminal d'affichage de support WO2018108013A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611154485.5 2016-12-14
CN201611154485.5A CN108234735A (zh) 2016-12-14 2016-12-14 一种媒体显示方法及终端

Publications (1)

Publication Number Publication Date
WO2018108013A1 true WO2018108013A1 (fr) 2018-06-21

Family

ID=62557913

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/114843 WO2018108013A1 (fr) 2016-12-14 2017-12-06 Procédé et terminal d'affichage de support

Country Status (2)

Country Link
CN (1) CN108234735A (fr)
WO (1) WO2018108013A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021083125A1 (fr) * 2019-10-31 2021-05-06 Oppo广东移动通信有限公司 Procédé de commande d'appel et produit associé
CN112770063A (zh) * 2020-12-22 2021-05-07 北京奇艺世纪科技有限公司 一种图像生成方法及装置

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109377540B (zh) * 2018-09-30 2023-12-19 网易(杭州)网络有限公司 面部动画的合成方法、装置、存储介质、处理器及终端
CN110062116A (zh) * 2019-04-29 2019-07-26 上海掌门科技有限公司 用于处理信息的方法和设备
CN110336733B (zh) * 2019-04-30 2022-05-17 上海连尚网络科技有限公司 一种呈现表情包的方法与设备
CN110784762B (zh) * 2019-08-21 2022-06-21 腾讯科技(深圳)有限公司 一种视频数据处理方法、装置、设备及存储介质
CN110446066B (zh) * 2019-08-28 2021-11-19 北京百度网讯科技有限公司 用于生成视频的方法和装置
CN111063339A (zh) * 2019-11-11 2020-04-24 珠海格力电器股份有限公司 智能交互方法、装置、设备及计算机可读介质
CN112804440B (zh) * 2019-11-13 2022-06-24 北京小米移动软件有限公司 一种处理图像的方法、装置及介质
CN111596841B (zh) * 2020-04-28 2021-09-07 维沃移动通信有限公司 图像显示方法及电子设备
CN111741162B (zh) * 2020-06-01 2021-08-20 广东小天才科技有限公司 背诵提示方法及电子设备、计算机可读存储介质
EP3993410A1 (fr) * 2020-10-28 2022-05-04 Ningbo Geely Automobile Research & Development Co., Ltd. Système de caméra et procédé de génération de vue d'image de contact d'un il d'une personne
CN114827648B (zh) * 2022-04-19 2024-03-22 咪咕文化科技有限公司 动态表情包的生成方法、装置、设备和介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539354B1 (en) * 2000-03-24 2003-03-25 Fluent Speech Technologies, Inc. Methods and devices for producing and using synthetic visual speech based on natural coarticulation
CN101482975A (zh) * 2008-01-07 2009-07-15 丰达软件(苏州)有限公司 一种文字转换动画的方法和装置
CN101968893A (zh) * 2009-07-28 2011-02-09 上海冰动信息技术有限公司 游戏音唇同步系统
CN104238991A (zh) * 2013-06-21 2014-12-24 腾讯科技(深圳)有限公司 语音输入匹配方法及装置
CN104239394A (zh) * 2013-06-18 2014-12-24 三星电子株式会社 包括显示装置和服务器的翻译系统及其控制方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104468959A (zh) * 2013-09-25 2015-03-25 中兴通讯股份有限公司 移动终端通话过程中显示图像的方法、装置及移动终端

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6539354B1 (en) * 2000-03-24 2003-03-25 Fluent Speech Technologies, Inc. Methods and devices for producing and using synthetic visual speech based on natural coarticulation
CN101482975A (zh) * 2008-01-07 2009-07-15 丰达软件(苏州)有限公司 一种文字转换动画的方法和装置
CN101968893A (zh) * 2009-07-28 2011-02-09 上海冰动信息技术有限公司 游戏音唇同步系统
CN104239394A (zh) * 2013-06-18 2014-12-24 三星电子株式会社 包括显示装置和服务器的翻译系统及其控制方法
CN104238991A (zh) * 2013-06-21 2014-12-24 腾讯科技(深圳)有限公司 语音输入匹配方法及装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021083125A1 (fr) * 2019-10-31 2021-05-06 Oppo广东移动通信有限公司 Procédé de commande d'appel et produit associé
CN112770063A (zh) * 2020-12-22 2021-05-07 北京奇艺世纪科技有限公司 一种图像生成方法及装置
CN112770063B (zh) * 2020-12-22 2023-07-21 北京奇艺世纪科技有限公司 一种图像生成方法及装置

Also Published As

Publication number Publication date
CN108234735A (zh) 2018-06-29

Similar Documents

Publication Publication Date Title
WO2018108013A1 (fr) Procédé et terminal d'affichage de support
CN110941954B (zh) 文本播报方法、装置、电子设备及存储介质
CN110288077B (zh) 一种基于人工智能的合成说话表情的方法和相关装置
JP6019108B2 (ja) 文字に基づく映像生成
CN110808034A (zh) 语音转换方法、装置、存储介质及电子设备
US20190222806A1 (en) Communication system and method
CN111106995B (zh) 一种消息显示方法、装置、终端及计算机可读存储介质
CN109859298B (zh) 一种图像处理方法及其装置、设备和存储介质
JP2014519082A5 (fr)
CN111294463B (zh) 一种智能应答方法及系统
CA2677051A1 (fr) Reseau de communication et dispositifs de conversion texte/parole et texte/animation faciale
CN112188304A (zh) 视频生成方法、装置、终端及存储介质
CN107291704A (zh) 处理方法和装置、用于处理的装置
CN112509609B (zh) 音频处理方法、装置、电子设备以及存储介质
KR20150017662A (ko) 텍스트-음성 변환 방법, 장치 및 저장 매체
CN110990534A (zh) 一种数据处理方法、装置和用于数据处理的装置
CN111199160A (zh) 即时通话语音的翻译方法、装置以及终端
CN113724683A (zh) 音频生成方法、计算机设备及计算机可读存储介质
CN110830845A (zh) 一种视频生成方法、装置及终端设备
CN110298150B (zh) 一种基于语音识别的身份验证方法及系统
CN111160051B (zh) 数据处理方法、装置、电子设备及存储介质
CN111462279B (zh) 图像显示方法、装置、设备及可读存储介质
CN112837668A (zh) 一种语音处理方法、装置和用于处理语音的装置
CN108174123A (zh) 数据处理方法、装置及系统
CN112562687B (zh) 音视频处理方法、装置、录音笔和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17880305

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17880305

Country of ref document: EP

Kind code of ref document: A1