WO2021080190A1 - Procédé et dispositif de fourniture de service vocal - Google Patents

Procédé et dispositif de fourniture de service vocal Download PDF

Info

Publication number
WO2021080190A1
WO2021080190A1 PCT/KR2020/012559 KR2020012559W WO2021080190A1 WO 2021080190 A1 WO2021080190 A1 WO 2021080190A1 KR 2020012559 W KR2020012559 W KR 2020012559W WO 2021080190 A1 WO2021080190 A1 WO 2021080190A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
electronic device
image
information
stored
Prior art date
Application number
PCT/KR2020/012559
Other languages
English (en)
Korean (ko)
Inventor
부영종
최송아
Original Assignee
삼성전자 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 삼성전자 주식회사 filed Critical 삼성전자 주식회사
Publication of WO2021080190A1 publication Critical patent/WO2021080190A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • Embodiments of the present disclosure relate to a method of providing a voice service, and more particularly, to a method, an apparatus, and a server for providing a voice service to a user by extracting a talker's voice from an image.
  • Electronic devices can support various input methods for inputting voice through a microphone.
  • Electronic devices are speech synthesis technologies that artificially create speech by dividing the voice input through a microphone into certain units of speech, then adding a sign and inputting it to a synthesizer, then recombining only the necessary speech units according to the instructions.
  • speech synthesis technologies that artificially create speech by dividing the voice input through a microphone into certain units of speech, then adding a sign and inputting it to a synthesizer, then recombining only the necessary speech units according to the instructions.
  • text-to-speech text-to-speech. Transformation (text-to-speech, TTS) can be used.
  • TTS text-to-speech
  • a technology that uses voices to perform biometrics is being studied.
  • a problem to be solved by the present disclosure is to provide a method, a device, and a system for providing a voice service by extracting a speaker's voice from an image.
  • the electronic device may provide a voice service using not only a voice stored in a database and a user's voice, but also voices other than the voices.
  • FIG. 1 is a schematic diagram illustrating a method of providing a voice service by extracting voice from an image reproduced in an electronic device according to an exemplary embodiment.
  • FIG. 2 is a flowchart illustrating a method of providing a voice service by extracting a voice of an object from an image, according to an exemplary embodiment.
  • 3A is a flowchart illustrating a method of providing a voice service by acquiring an image from an external device according to an exemplary embodiment.
  • 3B is an exemplary diagram of a method of providing a voice service by acquiring an image from an external device according to an exemplary embodiment.
  • 3C is an exemplary diagram of a method for providing a voice service by performing voice recognition by a display device according to an exemplary embodiment.
  • 4A and 4B are exemplary diagrams of a method of providing a voice service by searching for an image of an object from a server according to an exemplary embodiment.
  • FIG. 5 is an exemplary diagram of a method of receiving voice data extracted from an external device from a server according to an exemplary embodiment.
  • FIG. 6 is a flowchart illustrating a method of generating a text-to-speech (TTS) voice by modeling a screen and a voice from an image according to an exemplary embodiment.
  • TTS text-to-speech
  • FIG. 7 is an exemplary diagram of a method of detecting an utterance state of an object in consideration of additional information of an image according to an exemplary embodiment.
  • FIG. 8 is an exemplary diagram of a method for extracting and verifying a voice voice in order to provide a voice service according to an embodiment.
  • FIG. 9 is an exemplary diagram of a method for a user to store a voice for a specific object from an image including a plurality of people, according to an exemplary embodiment.
  • FIG. 10 is a block diagram illustrating a configuration of an electronic device according to an exemplary embodiment.
  • FIG. 11 is a block diagram illustrating a detailed configuration of an electronic device according to an exemplary embodiment.
  • An operation method of an electronic device includes an operation of obtaining information on an object to store a voice, and detecting an utterance state of the object from an image including the object based on the obtained information on the object.
  • An operation of extracting the voice of the object, an operation of storing voice information related to the voice of the object and the voice of the object as a result of modeling the extracted voice, and an operation of providing a voice service based on the stored voice information It may include.
  • a method of operating an electronic device includes an image being played in the device, an image being played in an external device communicating with the device, and an image stored in the external device based on the acquired object information.
  • the operation of obtaining an image including the object from an image including at least one of the above may be further included.
  • the method of operating an electronic device further includes an operation of obtaining an image including a plurality of people, and the operation of extracting a voice of the object is performed based on the obtained information on the object. And recognizing the object from the acquired image, and extracting the voice of the object by detecting the speech state of the recognized object from the acquired image.
  • the providing of the voice service may include providing a voice service of outputting an output text determined based on the stored voice information as the voice of the object. have.
  • the voice information related to the voice of the object is voice information including at least one of dialect, intonation, buzzword, tone, pronunciation, parody, and dialogue for the voice of the object.
  • the method of operating an electronic device further includes an operation of receiving an image including the object from a server, based on the obtained information on the object, and extracting the voice of the object.
  • an operation of receiving an image including the object from a server based on the obtained information on the object, and extracting the voice of the object.
  • the operation of extracting the voice of the object includes an operation of analyzing the image to obtain a screen where the face of the object appears and a time when the face of the object appears, and the object Detecting the firing state of the object from the image corresponding to the acquired time based on the analysis result including at least one of the movement of the lips of the object, the shape of the mouth of the object, the tooth recognition of the object, and the script of the image
  • the operation of extracting the voice of the object may be included.
  • the operation of storing voice information and the voice of the object reflects the operation of receiving a modeling result of the voice of the object from the server and the modeling result received from the server.
  • the operation of storing the voice information and the voice of the object may be included.
  • the modeling result is a result of determining that the voice of the object may be requested by another device or the device by another device, and is extracted by the other device. It may be a modeling result stored in the server.
  • a method of operating an electronic device includes an operation of determining whether a voice of the object is likely to be requested by another device, and a device having a different voice of the object, based on the acquired information on the object. As a result of determining that there is a possibility to be requested by, the operation of transmitting the stored voice information and the stored object voice to the server may be further included.
  • a method of operating an electronic device includes an operation of receiving a response from a user of the device as to whether the extracted voice is the same as the voice of the object, and the extracted response based on the received response.
  • the operation of determining whether the voice is the same as the voice of the object may be further included, and the storing of the voice information and the voice of the object may store the voice information and the voice of the object as a result of determining that the voice is the same.
  • An electronic device includes a communication unit, a memory that stores one or more instructions, at least one processor that executes the one or more instructions stored in the memory, and the at least one processor is an object to store a voice.
  • a communication unit a memory that stores one or more instructions
  • the at least one processor executes the one or more instructions stored in the memory
  • the at least one processor is an object to store a voice.
  • a computer-readable recording medium storing one or more programs according to another embodiment, when the one or more programs are executed by one or more processors of the electronic device, causes the electronic device to: And, based on the obtained information on the object, by detecting the speech state of the object from the image including the object, the speech of the object is extracted, and as a result of modeling the extracted speech, the It may include instructions for storing voice information related to the voice of the object and the voice of the object, and providing a voice service based on the stored voice information.
  • At least one modifies the entire list of elements and does not individually modify the elements of the list.
  • at least one of A, B, and C means only A, only B, only C, all A and B, all B and C, all A and C, all A and B and C, or a combination thereof. Points to.
  • first and second may be used to describe various elements, but the elements are not limited by the terms. The terms are used only for the purpose of distinguishing one component from other components. For example, without departing from the scope of the present disclosure, a first element may be referred to as a second element, and similarly, a second element may be referred to as a first element.
  • the term and/or includes a combination of a plurality of related items or any one of a plurality of related items.
  • unit used in the present disclosure means a hardware component such as software, FPGA or ASIC, and “unit” performs certain roles.
  • unit is not meant to be limited to software or hardware.
  • the “unit” may be configured to be in an addressable storage medium or may be configured to reproduce one or more processors.
  • unit refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, procedures, Includes subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays and variables.
  • the functions provided within the components and "units” may be combined into a smaller number of components and “units” or may be further separated into additional components and “units”.
  • the voice service includes a speech synthesis service, a text-to-speech (TTS) service, a biometric-based voice service, a voice service for the visually impaired, a voice service of a virtual assistance, and a voice guide service.
  • TTS text-to-speech
  • a biometric-based voice service a voice service for the visually impaired
  • a voice service of a virtual assistance a voice guide service.
  • Voice, fingerprint service, etc. may mean a service provided using all voices.
  • An embodiment of the present disclosure may be represented by functional block configurations and various processing steps. Some or all of these functional blocks may be implemented with various numbers of hardware and/or software components that perform specific functions.
  • the functional blocks of the present disclosure may be implemented by one or more microprocessors, or may be implemented by circuit configurations for a predetermined function.
  • the functional blocks of the present disclosure may be implemented in various programming or scripting languages. Functional blocks may be implemented as an algorithm executed on one or more processors.
  • the present disclosure may employ conventional techniques for electronic environment setting, signal processing, and/or data processing.
  • FIG. 1 is a schematic diagram illustrating a method of providing a voice service by extracting voice from an image reproduced in an electronic device according to an exemplary embodiment.
  • the electronic device 100 may obtain information on an object 120 to store a voice from the user. .
  • the object 120 represents a subject performing speech in the image 110.
  • the electronic device 100 detects an utterance state of the object 120 from the image 110 including the object 120 based on information on the object 120 to be stored in the voice. ) Can be extracted.
  • the electronic device 100 may acquire and store voice information related to the voice of the object 120 and the voice of the object 12.
  • the electronic device 100 may provide a voice service to a user based on the stored voice information.
  • the electronic device 100 may receive a user's voice through a microphone, and the microphone may convert the received voice into a voice signal.
  • the electronic device 100 may convert a speech part into a computer-readable text by performing ASR (Automatic Speech Recognition) on a speech signal that is an analog signal. That is, the electronic device 100 may perform a Speech to Text (STT) function.
  • the electronic device 100 acquires information on an object to store speech based on a computer-readable text. Based on the acquired object information, the object's speech is extracted by detecting the speech state of the object from the image containing the object, and as a result of modeling the extracted speech, speech information related to the object's speech and the object's speech are obtained. Can be saved.
  • the electronic device 100 may perform a text to speech (TTS) function to provide a voice service for outputting an output text determined based on stored voice information as a voice of an object.
  • the electronic device 100 may include a processor that performs a speech to text (STT) function and a processor that performs a text to speech (TTS) function.
  • the electronic device 100 may include a processor that performs both an STT function and a TTS function.
  • the server may include a Speech to Text (STT) server and a Text to Speech (TTS) server, and may include one server capable of performing both the STT function and the TTS function.
  • STT Speech to Text
  • TTS Text to Speech
  • the electronic device 100 may receive an input message from the user saying "Save Baek Jong-won's voice".
  • the input message may include'Baek Jong-won' corresponding to the name of the object 120 as information on the object 120 to store the voice. That is, the electronic device 100 may obtain the name of the object 120 as information on the object 120 to store a voice from an input message received from the user.
  • the electronic device 100 may detect whether the object 120 is in the speech state from the image 110 based on the name of the object 120.
  • the electronic device 100 analyzes the face of the object 120, the movement of the lips of the object 120, the shape of the mouth of the object 120, the tooth recognition of the object 120, and the image 110 It is possible to detect whether the object 120 is in a speech state in consideration of the presence or absence of a caption, a script corresponding to the image 110, etc., but factors considered for detecting whether the object 120 is in the speech state are not limited thereto.
  • the electronic device 100 may extract the voice of the object 120 in the speech state section in which the object 120 is detected as the speech state. For example, after starting the analysis of the image 110, the screen is captured every 1 second for a predetermined period of time, and whether the object 120 in the screen is open or not, the shape of the lips, the presence of subtitles, and It is possible to detect whether the object 120 is in an ignition state by analyzing the presence or absence of a tooth, a script, and the like. Accordingly, the electronic device 100 may extract the voice of the object 120 in the speech state section in which the object is detected as the speech state. For example, the electronic device 100 may record a voice for 1 second from a time detected as an utterance state.
  • the method of detecting the speech state and the method of recording a voice are not limited thereto, and the electronic device 100 may capture a screen for a time shorter or longer than 1 second, and may be shorter than 1 second. Or recording a voice for a long period of time, and not detecting whether the object 120 is in a speech state for a predetermined time, but until a text-to-speech (TTS) can be generated using the extracted voice, It is also possible to detect the state of speech and record the corresponding voice.
  • TTS text-to-speech
  • the electronic device 100 may model a voice according to an utterance state, map it to an object, and store voice information related to the voice of the object 120, the voice of the object 120, and the like. That is, the electronic device 100 may model and merge the extracted screen, the time detected as the speech state, and the extracted voice.
  • the voice information related to the voice of the object 120 may include dialect, intonation, buzzword, speech, pronunciation, parody, dialogue, etc. of the voice of the object 120, but is not limited thereto.
  • the electronic device 100 extracts the voice of the object 120 saying "Did you know?" in the speech section of the object 120 from the image 110, the object 120 is Voice information indicating that' ⁇ Yu' is used instead of' ⁇ yo' may be acquired and the voice information may be stored together with the voice of the object 120.
  • the electronic device 100 may provide the voice service 130 based on the stored voice information. For example, the user may request an alarm from the electronic device 100 using the voice of the object 120. The electronic device 100 may provide an alarm to the user through the voice of the object 120 at a specific time by the user. In an embodiment, in providing the voice service 130, the electronic device 100 may consider voice information related to the voice of the object 120. For example, instead of "Wake up", the electronic device 100 may provide a user with a voice service that outputs the text of "Wake up?", which is an output text determined based on voice information, as the voice of an object.
  • the electronic device 100 includes a smart phone, a tablet PC, a PC, a smart TV, a mobile phone, a personal digital assistant (PDA), a speaker, a laptop, a media player, a micro server, an e-book object recognition device, an object recognition device for digital broadcasting, It may be a kiosk, an MP3 player, a digital camera, a robot cleaner, a home appliance, and other mobile or non-mobile computing devices, but is not limited thereto.
  • the electronic device 100 may be a wearable device such as a watch, glasses, a hair band, and a ring having a communication function and a data processing function.
  • FIG. 2 is a flowchart illustrating a method of providing a voice service by extracting a voice of an object from an image, according to an exemplary embodiment.
  • the electronic device may acquire information on an object to store voice.
  • the electronic device may receive information about the object's name, photo, name, trade name, portrait, stage name, pen name, abbreviation, etc. as information about an object to store a voice from a user.
  • the electronic device may acquire an image including the object based on information received from the user. For example, as an image including an object, the electronic device may acquire an image being played on the electronic device, an image being played on an external device capable of communicating with the electronic device, and an image stored in the external device, but are limited thereto. no.
  • the external device capable of communicating with the electronic device is a server, a smart phone, a tablet PC, a PC, a TV, a smart TV, a mobile phone, a personal digital assistant (PDA), a speaker, a laptop, a media player, a micro server, and an electronic device.
  • the electronic device may detect whether the object is in the speech state from the image including the object, based on the acquired information on the object.
  • the electronic device may obtain a screen on which the face of the object appears and a time when the face of the object appears by analyzing an image including the object.
  • the voice of the object corresponding to the time when the face of the object appears may be extracted.
  • the electronic device may detect whether an object is in a speech state from an image in consideration of movement of the lips of the object, the shape of the mouth of the object, recognition of the teeth of the object, the presence of a caption in the image, and a script of the image.
  • an image including an object may include people other than the object.
  • the electronic device may recognize an object to be extracted from an image including a plurality of people based on information on the object, for example, using face recognition technology.
  • the electronic device may detect whether an object recognized among a plurality of people is in a speech state.
  • the electronic device may extract the voice of the object in the speech state section in which the object is detected as the speech state.
  • the electronic device after receiving a request to extract the voice of the object from the user, the electronic device detects whether the object is in the speech state for a certain period of time, and in the speech state section in which the object is in the speech state, the voice of the object is detected. Can be extracted.
  • the electronic device may start image analysis. It is possible to detect whether an object is in fire state by capturing a screen per second for a certain period of time and analyzing the captured screen.
  • the electronic device may extract the voice of the object in the speech state section in which the object is detected as the speech state.
  • the electronic device may record the voice of the object for 1 second from a section that is determined to be in an ignition state as a result of capturing a screen per second by the electronic device.
  • the electronic device may record the voice of the object from the section determined to be in the utterance state to the section determined as not in the utterance state.
  • the electronic device may extract the voice of the object until it can generate text-to-speech (TTS) using the extracted voice.
  • TTS text-to-speech
  • the electronic device may extract only the voice of the object by distinguishing the voice of the object, such as background music or noise, in a section of the speech state in which the object is detected as the speech state.
  • the voice of the object such as background music or noise
  • an electronic device can remove background music by removing noise except for a frequency band in which a component corresponding to a human voice is mainly located, or by overwriting the background music in reverse phase.
  • a method of removing background music or noise This is not limited thereto, and other conventional techniques may be used.
  • the electronic device may store voice information related to the voice of the object and the voice of the object.
  • the electronic device may model a screen for a speech state section detected in operation 230 and a voice extracted in operation 250. That is, the electronic device may extract the voice of the object by correlating the section detected as being the object's speech state with the extracted voice of the object.
  • the electronic device may store voice information related to the voice of the object in addition to the modeled voice.
  • the voice information may be dialect, dialect, intonation, buzzword, tone, pronunciation, parody, dialogue, etc. for the voice of the object, but is not limited thereto.
  • the electronic device may provide a voice service based on the stored voice information. For example, when the electronic device receives a notification request from a user, a book reading request using text-to-speech (TTS), a virtual assistance request using object voice, etc., the electronic device provides a voice service corresponding to the request. It can be provided to the user.
  • the electronic device may provide a voice service for outputting an output text determined based on stored voice information as a voice of an object. For example, the electronic device provides a voice service that determines the output text corresponding to “Hello?” based on voice information from the text corresponding to “Hello?” and outputs the determined output text as the voice of the object. can do.
  • the electronic device may obtain an image including an object from an image being played in the electronic device, an image being played in an external device communicating with the electronic device, an image stored in the external device, etc., based on the information on the object.
  • the external device capable of communicating with the electronic device is a server, a smart phone, a tablet PC, a PC, a TV, a smart TV, a mobile phone, a personal digital assistant (PDA), a speaker, a laptop, a media player, a micro server, and an electronic device.
  • PDA personal digital assistant
  • Book object recognition devices digital broadcasting object recognition devices, kiosks, MP3 players, digital cameras, robot vacuum cleaners, home appliances, other mobile or non-mobile computing devices, watches, glasses, hair bands and rings with communication and data processing functions It may include, but is not limited thereto.
  • the electronic device acquires an image including a plurality of people, recognizes an object from the acquired image based on information on the acquired object, and recognizes the speech state of the object recognized from the acquired image. Can be detected to extract the voice of the object.
  • the electronic device may provide a voice service for outputting an output text determined based on stored voice information as a voice of an object.
  • the electronic device receives an image including the object from the server based on the acquired information on the object, and reproduces the image received from the server at double speed, thereby detecting the firing state of the object, You can extract the voice of an object at double speed.
  • the electronic device analyzes the image to obtain a screen on which the face of the object appears and a time when the face of the object appears, and the movement of the lips of the object, the shape of the mouth of the object, recognition of teeth of the object, and
  • the speech of the object may be extracted by detecting the speech state of the object from the image corresponding to the acquired time.
  • the electronic device may receive a modeling result of the object's voice from the server and reflect the modeling result received from the server to store voice information and the object's voice.
  • the modeling result is a result of determining that the voice of the object is likely to be requested by another device or the device by another device, and is a modeling result extracted by the other device and stored in the server. I can.
  • the electronic device determines whether there is a possibility that the voice of the object may be requested by another device, based on the acquired information on the object, and determines that the voice of the object is likely to be requested by another device. As a result, it is possible to transmit the stored voice information and the stored voice of the object to the server.
  • the electronic device receives a response as to whether the voice extracted from the user is the same as the voice of the object, determines whether the extracted voice is the same as the voice of the object based on the received response, and As a result of the determination, the voice information and the voice of the object can be stored.
  • 3A is a flowchart illustrating a method of providing a voice service by acquiring an image from an external device according to an exemplary embodiment.
  • 3B is an exemplary diagram of a method of providing a voice service by acquiring an image from an external device according to an exemplary embodiment.
  • the present disclosure describes a method of providing a voice service by acquiring an image from an external device according to an exemplary embodiment.
  • the electronic device 100 may obtain information 310 about an object to store a voice from a user.
  • the electronic device 100 may obtain information about a type of a voice service desired to be provided by a user in addition to the information 310 about the object.
  • the electronic device 100 may acquire the object information 310 and the voice service type at the same time, or acquire only the object information 310, obtain the voice service type first, and then acquire the voice service type.
  • Information about 310 may be obtained.
  • the virtual assistance of the electronic device 100 may receive a voice recognized by the user as "Read a book with a true BTS voice".
  • the electronic device 100 may obtain information 310 about an object called "BTS Jin” and a voice service type called "read a book” from the acquired voice. That is, the information 310 on the object may be the name of the object 340, but is not limited thereto, and may be a name, a photo, a name, a business name, a portrait, a stage name, a pseudonym, or the like of the object 340.
  • the electronic device 100 may receive an image 330 from the external device 320.
  • the electronic device 100 may receive an image 330 from an external device 320 that communicates with the electronic device 100 by wire or wirelessly.
  • the electronic device 100 may transmit and receive information about an object 310, an image 350, and audio data from the external device 320 using an output port or wireless communication with the external device 320. That is, the electronic device 100 may receive the image 330 including the object from the external device 320 and extract the voice of the object 340 from the image 330 received by the electronic device 100. .
  • the external device 320 receives information 310 about an object from the electronic device 100, extracts a voice of the object 340 based on the information 310 about the object, Voice data from which the voice of the object 340 is extracted by the external device 320 may be transmitted to the electronic device 100.
  • the external device 320 transmits and receives information about an object 310, an image 350, audio data, etc. to an external device 320 that is externally connected by using an output port or wireless communication with the electronic device 100. I can. That is, the electronic device 100 transmits the information 310 on the object to the external device 320, and the external device 320 transmits the object 340 from the image 330 being played back in the external device 320.
  • the voice may be extracted and voice data may be transmitted to the electronic device 100.
  • the output port may be HDMI, DP, or Thunderbolt for simultaneously transmitting video/audio signals, or may include ports for separately outputting a video signal and an audio signal.
  • wireless communication includes Bluetooth communication, Bluetooth Low Energy (BLE) communication, Near Field Communication, WLAN (Wi-Fi) communication, Zigbee communication, IRDA (infrared Data Association) communication, WFD (Wi -Fi Direct) communication, short-range communication including UWB (ultra wideband) communication, Ant+ communication, etc., mobile communication transmitting and receiving wireless signals with at least one of a base station, an external terminal, and a server on a mobile communication network, and through a broadcast channel.
  • BLE Bluetooth Low Energy
  • WLAN Wi-Fi
  • Zigbee communication Zigbee communication
  • IRDA infrared Data Association
  • WFD Wi -Fi Direct
  • UWB ultra wideband
  • the external device 320 is a server, a smart phone, a tablet PC, a PC, a TV, a smart TV, a mobile phone, a personal digital assistant (PDA), a speaker, a laptop, a media player, a micro server, an e-book object recognition device.
  • the electronic device 100 may be a smartphone
  • the external device 320 may be a TV.
  • the electronic device 100 may transmit/receive data to a TV, which is the external device 320, or control the TV, which is the external device 320, using Wi-Fi/BT or infrared communication.
  • a TV which is the external device 320
  • control the TV which is the external device 320
  • Wi-Fi/BT Wireless Fidelity
  • An application for controlling the TV can be installed on the smartphone, and the TV can be controlled in other ways.
  • the electronic device 100 may detect whether the object 340 is in the speech state from the image 350 received from the external device 320, based on the information 310 on the object. In an embodiment, the electronic device 100 may analyze the received image 350 to obtain a screen on which the face of the object 340 appears and a time when the face of the object 340 appears. Also, a voice of an object corresponding to a time when the face of the object 340 appears may be extracted.
  • the electronic device 100 may use the movement of the lips of the object 340, the shape of the mouth of the object 340, the tooth recognition of the object 340, the presence or absence of a caption of the image 350, the script of the image 350 In consideration of the like, it is possible to detect whether the object 340 is in the speech state from the image 350.
  • the image 350 including the object may include people other than the object 340.
  • the electronic device 100 may recognize the object 340 to be extracted from an image including a plurality of people, based on the information 310 on the object, using, for example, face recognition technology. I can. Also, the electronic device 100 may detect whether the recognized object 340 among a plurality of people is in a speech state.
  • the electronic device 100 may acquire N images by capturing the image 350 including the object every predetermined time. Also, by determining whether the object 340 is included from the N images, the electronic device 100 may determine M images including the object among the N images. Also, the electronic device 100 may determine K utterance intervals in which the object 340 is detected to be in the utterance state from M images including the object 340.
  • the electronic device 100 extracts the voice of the object 340 from the received image 330 and stores voice information related to the voice of the object and the voice of the object.
  • the electronic device 100 may acquire the voice of the object 340 in a section corresponding to a time when K images are captured.
  • the electronic device 100 may store the voice of the object 320 and voice information related to the voice of the object 320.
  • the electronic device 100 may store both the voice of the object 320 and voice information related to the voice of the object 320 by extracting the voice of the object 320.
  • the device 100 may obtain the voice of the object 320 and voice information related to the voice of the object 320 from the external device 320.
  • the electronic device 100 receives only the voice of the object 320 from the external device 320, and receives voice information related to the voice from the voice of the object 320 received by the electronic device 100. By extracting, voice and voice information of the object 320 may be stored.
  • the electronic device 100 may provide a voice service based on the stored voice information.
  • the electronic device 100 may provide a service for reading a book with the voice of the object 320 to the user based on the user's request for a book reading service.
  • 3C is an exemplary diagram of a method for providing a voice service by performing voice recognition by a display device according to an exemplary embodiment.
  • the display device 360 may receive a voice command from a user.
  • the display device 360 may directly receive a voice command from a user.
  • the remote controller 390 may receive a voice command from a user, and the display device 360 may receive a voice command from the remote controller 390.
  • the display device 360 may receive a user's voice command through an internal microphone of the display device 360.
  • the display device 360 may perform remote voice recognition on a user's voice through an internal microphone of the display device 360. For example, while the user is watching the image 370 including the object 380 through the display device 360, the user may provide the display device 360 with a voice saying “Save the BTS true voice”.
  • the display device 360 may obtain information on an object to store the voice by performing remote voice recognition on the user's voice.
  • the display device 360 may extract and model the voice of the object by detecting the speech state of the object from the image including the object, based on the information on the object. Accordingly, the display device 360 may store the voice of the object 380 and voice information related to the voice of the object.
  • the internal microphone of the remote controller 390 may receive a user's voice command.
  • the display device 360 is a device capable of wireless or wired communication with the remote controller 390, and the display device 360 may receive a user's voice command from the remote controller 390. Accordingly, the display device 360 may store voice and voice information of the object 380 based on the user's voice command. For example, while the user is watching the image 370 including the object 380 through the display device 360, the user may provide the remote controller 390 with a voice saying "Save BTS true voice". The remote controller 390 may perform voice recognition on the user's voice and transmit a user's voice command to the display device 360.
  • the display device 360 may acquire information on an object to store the voice based on the received voice command.
  • the display device 360 may extract and model the voice of the object by detecting the speech state of the object from the image including the object, based on the information on the object. Accordingly, the display device 360 may store the voice of the object 380 and voice information related to the voice of the object.
  • 4A and 4B are exemplary diagrams of a method of providing a voice service by searching for an image of an object from a server according to an exemplary embodiment.
  • the electronic device 100 may obtain information 410 about an object to store a voice from a user.
  • the user inputs a voice to the virtual assistant of the electronic device 100, saying "Save Baek Jong-won's voice," and the electronic device 100 receives information 410 about an object called "Baek Jong-won" from the input voice.
  • operation 401 may correspond to operation 301 described above with reference to FIG. 3.
  • the electronic device 100 may receive the voice data 430 from the server 420.
  • the electronic device 100 receives an image from the server 420 and extracts the voice of the object, or after the server 420 models the voice of the object, the electronic device 100 receives the modeling result from the server.
  • the electronic device 100 may receive the voice and voice information of the object from the server 420.
  • the type of data that the device 100 receives from the server 420 is not limited thereto.
  • the electronic device 100 when the electronic device 100 receives a command to extract the voice of an object from a user, the electronic device 100 retrieves information 410 about the object from a video sharing site, a portal site, etc. An image related to the object may be obtained, and audio data 430 of the object may be obtained from the image related to the object.
  • the server acquires an image of an object
  • the object's voice is extracted and transmitted to the electronic device, so that the electronic device may obtain the object's voice data 430 from the server.
  • the electronic device 100 in a state in which the user cannot clearly recognize from which image the audio of the object is extracted, the electronic device 100 automatically extracts the audio of the object in the background. And acquire and store voice information.
  • the electronic device 100 receives an image including the object from the server 420 based on the information 410 on the object, and plays the image received from the server at double speed, so that the object is in an utterance state. It detects and detects, and according to the detected speech state, it is possible to extract the voice of the object at double speed.
  • the electronic device 100 or the server 420 may quickly acquire a voice of an object and voice information related to the voice of the object by reproducing an image at double speed.
  • the electronic device 100 may store voice information related to the voice of the object and the voice of the object based on the received voice data 430.
  • the voice data 430 may include both the voice of the object and voice information related to the voice of the object.
  • the voice data 430 includes only the voice of the object, and the electronic device Based on the voice data 430 received by the 100, voice information related to the voice of the object 320 may be obtained.
  • the audio data 430 may include only an image including an object, and may extract and store the audio of the object and audio information related to the audio of the object from the image received by the electronic device 100.
  • the electronic device 100 may provide a voice service based on the stored voice information. For example, based on the user's request for a book reading service, the electronic device 100 may provide the user with a schedule reminder with the voice of the object.
  • the electronic device 100 may pre-extract the voice of the object by scanning the image without reproducing the image. According to an embodiment, the electronic device 100 may quickly extract a voice of an object by using data stored in an internal memory or an external memory.
  • FIG. 5 is an exemplary diagram of a method of receiving voice data extracted from an external device from a server according to an exemplary embodiment.
  • the external device 530 may extract and store the voice of the object 550. Also, when it is determined that there is a possibility that the voice of the object may be requested by another device, the external device 530 may transmit the voice of the object to the server 520. In addition, the server 520 may receive and store the voice of the object 550 detected by the external device 530 from the external device 530. In an embodiment, the electronic device 100 may receive a voice of an object extracted by the external device 530 from the server 520. For example, the electronic device 100 may receive a modeling result of the voice of the object from the server 520 and store voice and voice information of the object based on the received modeling result. For another example, the electronic device 100 may receive and store voice and voice information of an object from the server 520.
  • the external device 530 may extract the voice of the object 550 and transmit the voice data 570 to the server 520.
  • the external device 530 may acquire information on an object by a user of the external device 530.
  • the external device 530 may extract the voice of the object 550 and voice information related to the voice of the object 550 from the image 540 based on the acquired information about the object.
  • the external device 530 may transmit the extracted voice and voice information of the object to the server 520.
  • the electronic device 100 may obtain information 510 about an object 550 to store a voice from a user. For example, the user inputs a voice to the virtual assistant of the electronic device 100, saying "Save Baek Jong-won's voice," and the electronic device 100 receives information 510 about an object called "Baek Jong-won" from the input voice. Can be obtained.
  • the electronic device 100 may receive voice information of the object 550 and voice 580 from the server 520. Also, the server 520 may transmit voice information and voice 580 to the electronic device 100 based on the voice data 570 extracted by the external device 530.
  • the electronic device 100 may receive and store voice information and voice 580 of an object from the server 520.
  • the object corresponds to a famous person
  • the operation of extracting the voice of the object by several devices is omitted
  • the voice of the object extracted from the device is stored in the server, and the voice stored by another device can be used.
  • the time for the device to acquire the voice is saved, the accuracy of the extracted voice is improved, and the quality of the voice service can be improved.
  • the electronic device 100 may provide a voice service to a user by using voice information and voice 580 of an object acquired from the server 520.
  • the electronic device 100 may provide a voice service for outputting an output text determined based on stored voice information as a voice of an object.
  • the user may ask the virtual assistant to "read the recipe for Aliiolio pasta in Baek Jong-won's voice.”
  • the electronic device 100 may output information about an Alio Olio pasta recipe as a Baek Jong-won voice based on the stored voice information and the voice 580.
  • the virtual assistant determines the output test corresponding to the target text as the text of "I'll make Alio Olio Pasta” based on the stored voice information from the text of "I'll start making Alio Olio Pasta". I can. Accordingly, the virtual assistance may provide the determined output text "I'll make Alio Olio Pasta” to the user through the voice of Baek Jong-won.
  • the external device 530 may not transmit the audio data 570 to the server 520 but may transmit the image 540 itself including the object. Accordingly, the server 520 acquires voice information related to the voice of the object and the voice of the object from the received image 540, stores the voice information, and stores the voice information stored in the external device 530 and the electronic device 100, and The voice of the object can be transmitted.
  • FIG. 6 is a flowchart illustrating a method of generating a text-to-speech (TTS) voice by modeling a screen and a voice from an image according to an exemplary embodiment.
  • TTS text-to-speech
  • the electronic device may acquire information on an object to store voice.
  • the electronic device may receive information about the object's name, photo, name, trade name, portrait, stage name, pseudonym, and abbreviation as information about the object to store the voice from the user, but the information about the object is It is not limited.
  • the electronic device may acquire an image including the object.
  • the electronic device may acquire an image including an object based on information received from a user. For example, as an image including an object, the electronic device may acquire an image being played on the device, an image being played on an external device capable of communicating with the device, an image received from a server, and the like, but is not limited thereto.
  • the electronic device may capture a screen per specific time and record a voice at a specific time interval. For example, when the device detects a voice for 1 minute, the device may capture an image every second to capture 60 screens, and record voices every second to obtain 60 voice models. However, this is only an example, and the device may detect a voice for a time longer than 1 minute, or a voice for a short time for 1 minute, and detect the voice until the extracted voice of the object can be used for TTS. can do. In addition, the device can capture a screen every longer than 1 second or every less than 1 second, record a voice at intervals longer than 1 second, or record a voice at intervals shorter than 1 second. May be. That is, the electronic device may acquire N screens, N voice models, and image times at which N screens and voice models are acquired from an image including an object.
  • the electronic device may analyze a face to detect whether the object is in a speech state, based on the N images acquired in operation 630.
  • the face of the object may be analyzed and M images in which the object exists may be determined from the N images. That is, the electronic device acquires N images from an image including a plurality of people, based on the information on the object acquired in operation 610, and recognizes an object among a plurality of people from the N images, You can determine the images of the dog.
  • the electronic device may store the N voice models acquired in operation 640.
  • the stored N voice models may be used by the device to model the voice corresponding to the speech state of the object in operation 680.
  • the electronic device may detect whether the object is in the speech state. In an embodiment, it is possible to detect whether an object is in a speech state from the M images determined in operation 650.
  • the electronic device may determine whether the object is in an ignition state from the image in consideration of the movement of the lips of the object, the shape of the mouth of the object, recognition of the teeth of the object, whether a caption exists in the image, a script of the image, and the like.
  • the electronic device may determine K images in which the object is determined to be in an ignition state from the M images, and store a time when the K images are captured.
  • the electronic device may perform voice modeling corresponding to the speech state by using K voice models corresponding to the K images acquired in operation 670 from the N voice models stored in operation 660. For example, in operation 670, if K images obtained by determining that they are in the utterance state are images captured 2 seconds to 5 seconds and 10 seconds to 19 seconds after reproducing the image, the electronic device is 2 out of the N voice models. Voice modeling may be performed using a voice model recorded in seconds to 6 seconds and 10 to 20 seconds. However, this is only an example, and the voice modeling method is not limited thereto.
  • the electronic device may additionally consider a script, a caption, etc. corresponding to an image including an object. For example, the electronic device receives a script corresponding to an image from the server and considers the script when determining the ignition time of the object, or when determining the utterance time by reading the subtitle of the video using OCR, , After changing the voice in the video to text, the text can be considered when determining the utterance point of the object, but additional factors considered in the process of modeling the voice corresponding to the utterance state of the object are limited to this. It is not.
  • the electronic device may generate a text-to-speech (TTS) voice based on the voice modeling performed in operation 680.
  • TTS text-to-speech
  • the electronic device may provide various voice services to users by using the generated TTS voice.
  • some or all of the operations 610 to 690 are performed by a server other than an electronic device, and the electronic device may receive and use a result of the operation or data from the server.
  • the server may include a Speech to Text (STT) server and a Text to Speech (TTS) server, and may include one server capable of performing both the STT function and the TTS function.
  • STT function is a speech recognition function, representing a function of converting a speech language spoken by a person into problem data
  • TTS function is a speech synthesis function, and a function of generating sound waves of speech corresponding to text from text. have.
  • FIG. 7 is an exemplary diagram of a method of detecting an utterance state of an object in consideration of additional information of an image according to an embodiment.
  • the electronic device may detect a speech state by analyzing faces of objects included in a plurality of images.
  • the first picture 710 includes a first caption 715
  • the second picture 720 and the third picture 730 do not include a caption
  • the fourth picture 740 is 2 Captions 745 may be included.
  • the mouth shape of the object in the first image 710 and the third image 730 may be a closed mouth shape
  • the mouth shape of the object in the second image 720 and the fourth image 740 may be an open mouth shape. have.
  • the electronic device may recognize that the first image 710 is not in an ignition state.
  • the electronic device may determine that the object in the first image 710 is not in an ignition state by analyzing the shape of the mouth of the object, the movement of the lips, and the degree of exposure of the teeth.
  • the electronic device may determine that the second image 720 is in an ignition state.
  • the electronic device determines that the object in the second image 720 is in an ignition state because the mouth shape of the object in the second image 720 is an open mouth shape by analyzing the shape of the object's mouth, the movement of the lips, and the degree of tooth exposure. I can.
  • the electronic device may recognize that the third image 730 is not in an ignition state.
  • the electronic device may determine that the object in the third image 730 is not in an ignition state in consideration of the shape of the mouth of the object, the movement of the lips, the degree of exposure of the teeth, and a situation in which no caption in the image exists.
  • the electronic device may recognize that the fourth image 740 is in an ignition state.
  • the electronic device may determine that the object in the fourth image 740 is in an ignition state in consideration of the shape of the mouth of the object, the movement of the lips, the degree of tooth exposure, and the presence of the second caption 745 in the image.
  • the electronic device may additionally consider a script corresponding to an image including an object, a script included in pre-EPG information, and the like. For example, the electronic device receives a script corresponding to an image from the server and considers the script when determining the ignition time of the object, or when determining the utterance time by reading the subtitle of the video using OCR, , After changing the voice in the video to text, the text can be considered when determining the utterance point of the object, but additional factors considered in the process of modeling the voice corresponding to the utterance state of the object are limited to this. It is not.
  • FIG. 8 is an exemplary diagram of a method for extracting and verifying a voice voice in order to provide a voice service according to an embodiment.
  • the electronic device may be a speaker 820 that communicates with a user's smartphone 810 through wired or wireless communication (eg, Bluetooth).
  • the user may input a command "Save Baek Jong-won's voice" to the speaker 820 operating as a virtual assistant.
  • the speaker 820 operating as a virtual assistant may receive a user's command and cause the smartphone 810 to extract the voice of the object.
  • the speaker 820 operating as a virtual assistance may receive voice information and voice extracted from the smartphone 810 and perform a process of checking whether the extracted voice to the user is the voice of the object.
  • the speaker 820 operating as a virtual assistance may provide a voice service to a user by using the extracted voice of the object.
  • the smart phone 810 and the speaker 820 are only examples, and the corresponding process may be performed using an electronic device other than the smart phone 810 and/or the speaker 820.
  • electronic devices include smartphones, tablet PCs, PCs, smart TVs, mobile phones, personal digital assistants (PDAs), laptops, media players, micro servers, global positioning system (GPS) devices, e-book terminals, and digital broadcasting terminals. , Navigation, kiosk, MP3 player, digital camera, home appliance, and other mobile or non-mobile computing devices, but is not limited thereto.
  • the electronic device may be a wearable device such as a watch, glasses, hair band, and ring having a communication function and a data processing function.
  • the electronic device may obtain information on an object to store voice.
  • the electronic device may receive information about the object's name, photo, name, trade name, portrait, stage name, pen name, abbreviation, etc. as information about an object to store a voice from a user.
  • the electronic device may acquire an image including the object based on information received from the user. For example, as an image including an object, the electronic device may acquire an image being played on the device, an image being played on an external device capable of communicating with the device, an image received from a server, and the like, but is not limited thereto. For example, when a message "Save Baek Jong-won's voice" is received from a user, the electronic device may obtain information on an object corresponding to "Baek Jong-won" from the received message.
  • the electronic device may start an operation of storing the voice of the object.
  • the electronic device may acquire an image being played on the electronic device, an image being played on an external device communicating with the electronic device, an image received from the server, etc., based on the information on the object.
  • the types of images that are present are not limited thereto.
  • the electronic device may detect whether the object is in the speech state from the acquired image, and extract the voice of the object in the speech state section detected as being in the speech state.
  • the electronic device may obtain voice information related to the voice of the object and the voice of the object.
  • the voice information related to the voice of the object may be dialect, intonation, buzzword, speech, pronunciation, parody, noun metabolism, etc. for the voice of the object, but is not limited thereto.
  • the electronic device may receive confirmation from the user whether the acquired voice of the object is the voice requested by the user.
  • the electronic device may transmit a message requesting confirmation after reproducing the voice of the object, or may transmit a message requesting confirmation through the voice of the object.
  • the electronic device may output a voice saying "Is the voice of Jong-won Baek correct?" as a mechanical sound.
  • the electronic device is "Right?" as Baek Jong-won's voice information. Instead, it is possible to output a voice saying "Is it right?” as Baek Jong-won's voice, based on dialect information that "Is it right?”.
  • FIG. 9 is an exemplary diagram of a method for a user to store a voice for a specific object from an image including a plurality of people, according to an exemplary embodiment.
  • the electronic device may obtain information on an object from a user.
  • the virtual assistance of the electronic device may receive a voice that is recognized as "BTS Jungkook saves the voice" from the user.
  • the electronic device may obtain information on the object "BTS Jungkooki" from the acquired voice. That is, the information on the object may be the name of the object 906, but is not limited thereto, and may be a name, a photo, a name, a trade name, a portrait, a stage name, a pseudonym, an abbreviation, and the like of the object 906.
  • the electronic device acquires the image 900 and analyzes the image.
  • the electronic device may acquire an image being played on the electronic device, an image being played on an external device capable of communicating with the electronic device, an image received from a server, and the like, based on information about the object.
  • the acquired image 900 may be an image including a plurality of people 902, 904, 906, and 908.
  • the electronic device may determine the object 906 among the plurality of people 902, 904, 906 and 908 by analyzing the acquired image 900.
  • the electronic device may capture a plurality of images including the object 906 among the acquired images 900.
  • the electronic device may acquire N images by capturing the acquired image 900 at regular time intervals, and then determine M images including the object 906 from among the N captured images. have.
  • the electronic device may determine a section in which the object is in the utterance state among a plurality of images including the object 906.
  • the electronic device may detect whether an object is in a speech state from an image in consideration of a movement of a lip of an object, a shape of a mouth of an object, recognition of teeth of an object, presence or absence of a caption in an image, a script of an image, and the like.
  • the electronic device may extract the voice of the object in the speech state section in which the object is detected as the speech state. Also, as a result of modeling the extracted voice, the electronic device may obtain voice information related to the voice of the object and the voice of the object.
  • the electronic device may compare the result of modeling the extracted speech with the speech in the database.
  • the accuracy of the voice and the quality of the voice service may be improved by comparing it with the voice corresponding to the object in the database.
  • the electronic device may display a first person 902, a second person 904, a third person 906, and a fourth person 908 from an image including a plurality of people 902, 904, 906, and 908.
  • the probability that the voice extracted in the section corresponding to the time when the image was captured is the voice of the third person 906 corresponding to the object may be 25% because there are a total of four persons.
  • the electronic device considers the movement of the lips of the object, the shape of the mouth of the object, the recognition of the teeth of the object, the presence of a caption in the image, the script of the image, the presence of a microphone, etc., and the third person 906 corresponding to the object It can detect whether it is in fire.
  • the electronic device determines that the mouth shape of the first person 902 and the third person 906 is open, while the second person 904 and the fourth person 908 have a closed mouth shape, and the image is
  • the probability that the voice extracted from the captured section corresponds to the voice of the third person 906 may be 60%.
  • the electronic device may compare the extracted voice with the voice in the database. As a result of comparing the voice in the database, the probability that the third person 906 corresponds to the object may be 90%.
  • the electronic device may store the extracted voice as the voice of the object.
  • the electronic device may store voice information related to the voice of the object together with the voice of the object.
  • FIG. 10 is a block diagram illustrating a configuration of an electronic device according to an exemplary embodiment.
  • the electronic device 1000 may include a communication unit 1020, a processor 1040, and a memory 1060. However, not all of the components shown in FIG. 10 are essential components of the electronic device 1000.
  • the electronic device 1000 may be implemented by more components than the components illustrated in FIG. 10, or the electronic device 1000 may be implemented by fewer components than the components illustrated in FIG. 10.
  • the communication unit 1020, the processor 1040, and the memory 1060 may be implemented in the form of a single chip.
  • the communication unit 1020 may communicate with another electronic device connected to the electronic device 1000 by wire or wirelessly.
  • the communication unit 1020 may transmit an image being played on the electronic device 1000 to an external device or server, receive an image from an external device or server, and transmit/receive data to and from the external device or server. have.
  • the communication unit 1020 may transmit an image or image to an externally connected display device using an output port for outputting a video/audio signal or wireless communication for display of an image or image.
  • These output ports may include HDMI, DP, Thunderbolt, etc. for simultaneously transmitting video/audio signals, or ports for separately outputting video and audio signals.
  • the processor 1040 controls the overall operation of the electronic device 1000 and may include at least one processor such as a CPU or a GPU.
  • the processor 1040 may control other components included in the electronic device 1000 to perform an operation for operating the electronic device 1000.
  • the processor 1040 may execute a program stored in the memory 1060, read a stored file, or store a new file.
  • the processor 1040 may perform an operation for operating the electronic device 1000 by executing a program stored in the memory 1060.
  • the processor 1040 obtains information on an object to store a voice, and, based on the obtained information on the object, extracts the voice of the object by detecting the speech state of the object from the image including the object, As a result of modeling the extracted voice, voice information related to the voice of the object and the voice of the object may be stored, and a voice service may be provided based on the stored voice information.
  • the memory 1060 may install and store various types of data such as programs and files such as applications.
  • the processor 1040 may access and use data stored in the memory 1060 or may store new data in the memory 1060.
  • the memory 1060 may include a database.
  • the memory 1060 may store voice information related to the voice of the object and the voice of the object.
  • the electronic device 1000 may further include a sensor unit, a display, an antenna, a sensing unit, an input/output unit, a video processing unit, an audio processing unit, an audio output unit, a user input unit, and the like.
  • FIG. 11 is a block diagram illustrating a detailed configuration of an electronic device according to an exemplary embodiment.
  • the electronic device includes a processor 1100, a user input unit 1110, a display 1120, a video processing unit 1130, an audio processing unit 1140, an audio output unit 1150, a sensing unit 1160, and A communication unit 1170, an input/output unit 1180, and a memory 1190 may be included.
  • a processor 1100 the electronic device may be implemented by more constituent elements than the constituent elements shown in FIG. 11, or the electronic device may be implemented by fewer constituent elements than the constituent elements shown in FIG. 11.
  • processor 1100 user input unit 1110, display 1120, video processing unit 1130, audio processing unit 1140, audio output unit 1150, sensing unit 1160, communication unit 1170, input/output
  • the output unit 1180 and the memory 1190 may be implemented in the form of a single chip.
  • the processor 1100, the communication unit 1170, and the memory 1190 may correspond to the processor 1040, the communication unit 1020, and the memory 1060 described above with reference to FIG. 10. Accordingly, in describing the electronic device, a description overlapping with that in FIG. 10 will be omitted.
  • the user input unit 1110 refers to a means for a user to input data for controlling an electronic device.
  • the user input unit 1110 may include a key pad, a dome switch, a touch pad, a jog wheel, a jog switch, and the like, but is not limited thereto.
  • the display 1120 may display an image on a screen under the control of the processor 1100.
  • the image or image displayed on the screen may be received from the communication unit 1170, the input/output unit 1180, and the memory 1190.
  • the video processing unit 1130 processes image data to be displayed by the display 1120, and various images such as decoding, rendering, scaling, noise filtering, frame rate conversion, and resolution conversion for the image data. Processing operations can be performed.
  • the audio processing unit 1140 processes audio data.
  • the audio processing unit 1140 may perform various processing such as decoding, amplification, noise filtering, or the like for audio data.
  • the audio output unit 1150 includes audio included in a broadcast signal received through a tuner unit under the control of the processor 1100, audio input through the communication unit 1170, or the input/output unit 1180. , Audio stored in the memory 1190 may be output.
  • the audio output unit 1150 may include at least one of a speaker 1152, a headphone output terminal 1154, or a Sony/Philips Digital Interface (S/PDIF) output terminal 1156.
  • S/PDIF Sony/Philips Digital Interface
  • the sensing unit 1160 detects a user's voice, a user's image, or a user's interaction, and may include a microphone 1162, a camera 1164, and a light receiving unit 1166. have.
  • the detection unit 1160 detects the user's voice to generate the user's text-to-speech (TTS) voice, or obtains information on an object to store the voice to generate the TTS voice of the object. To receive a voice command from the user.
  • TTS text-to-speech
  • the microphone 1162 receives the user's uttered voice.
  • the microphone 1162 may convert the received voice into an electrical signal and output it to the processor 1100.
  • the microphone 1162 may cause the processor to generate the user's text-to-speech (TTS) voice by receiving the user's spoken voice and converting the received voice into an electrical signal.
  • TTS text-to-speech
  • the camera 1164 may receive an image (eg, a continuous frame) corresponding to a user's motion including a gesture in the camera recognition range. Further, the light receiving unit 1166 receives an optical signal (including a control signal) received from a remote control device.
  • the optical receiver 163 may receive an optical signal corresponding to a user input (eg, touch, pressing, touch gesture, voice, or motion) from the remote control device.
  • a control signal may be extracted from the received optical signal by the control of the processor 120.
  • the communication unit 1170 may include one or more modules that enable wireless communication between an electronic device and a wireless communication system or between an electronic device and a network in which another electronic device is located.
  • the communication unit 1170 may include a short-range communication unit 1172, a mobile communication unit 1174, a broadcast receiving unit 1176, a wireless Internet module (not shown), and the like.
  • the communication unit 1170 may be referred to as a transmission/reception unit.
  • the communication unit 1170 may connect to another external device or transmit/receive video/audio data using a wireless Internet module or a short-range communication unit 1172.
  • the short-range communication unit 1172 may mean a module for short-range communication.
  • Bluetooth Radio Frequency Identification (RFID), infrared data association (IrDA), Ultra Wideband (UWB), ZigBee, and the like may be used.
  • RFID Radio Frequency Identification
  • IrDA infrared data association
  • UWB Ultra Wideband
  • ZigBee ZigBee
  • the mobile communication unit 1174 may transmit and receive a wireless signal with at least one of a base station, an external terminal, and a server on a mobile communication network.
  • the wireless signal may include a voice call signal, a video call signal, or various types of data according to transmission and reception of text/multimedia messages.
  • the broadcast receiving unit 1176 may receive a broadcast signal and/or broadcast-related information from an external broadcast management server through a broadcast channel.
  • the broadcast signal may include a TV broadcast signal, a radio broadcast signal, and a data broadcast signal, as well as a broadcast signal in which a data broadcast signal is combined with a TV broadcast signal or a radio broadcast signal.
  • the wireless Internet module refers to a module for wireless Internet access, and may be built-in or external to the device.
  • WLAN Wireless LAN
  • Wibro Wireless broadband
  • Wimax Worldwide Interoperability for Microwave Access
  • HSDPA High Speed Downlink Packet Access
  • the electronic device can establish a Wi-Fi (P2P) peer-to-peer (P2P) connection with other electronic devices.
  • a streaming service between devices can be provided through such a Wi-Fi P2P connection, and a printing service can be provided by transmitting/receiving data or connecting to a printer.
  • the input/output unit 1180 includes video (eg, video), audio (eg, voice, music, etc.), and additions from the outside of the electronic device under the control of the processor 1100. Information (eg, EPG, etc.) can be transmitted and received.
  • the input/output unit 1180 is one of an HDMI port (High-Definition Multimedia Interface port 1182), a component jack (1184), a PC port (PC port, 1186), and a USB port (USB port, 1188).
  • the input/output unit 1180 may include a combination of an HDMI port 1182, a component jack 1184, a PC port 1188, and a USB port 1188.
  • the memory 1190 may install and store various types of data such as programs and files such as applications.
  • the processor 1100 may access and use data stored in the memory 1190 or may store new data in the memory 1190.
  • the memory 1190 may include a database.
  • the memory 1190 may store voice information related to the voice of the object and the voice of the object.
  • Computer-readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. Further, the computer-readable medium may include a computer storage medium. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • the disclosed embodiments may be implemented as a S/W program including instructions stored in a computer-readable storage media.
  • the computer as a device capable of calling a stored command from a storage medium and performing operations according to the disclosed embodiments according to the called command, may include an electronic device according to the disclosed embodiments.
  • the computer-readable storage medium may be provided in the form of a non-transitory storage medium.
  • non-transient means that the storage medium does not contain a signal and is tangible, but does not distinguish between semi-permanent or temporary storage of data in the storage medium.
  • control method according to the disclosed embodiments may be provided by being included in a computer program product.
  • Computer program products can be traded between sellers and buyers as commodities.
  • the computer program product may include a S/W program and a computer-readable storage medium in which the S/W program is stored.
  • the computer program product may include a product (eg, a downloadable app) in the form of a S/W program that is electronically distributed through a device manufacturer or an electronic market (eg, Google Play Store, App Store).
  • a product eg, a downloadable app
  • the storage medium may be a server of a manufacturer, a server of an electronic market, or a storage medium of a relay server temporarily storing an SW program.
  • the computer program product may include a storage medium of a server or a storage medium of a device in a system composed of a server and a device.
  • a third device eg, a smartphone
  • the computer program product may include a storage medium of the third device.
  • the computer program product may include a S/W program itself transmitted from a server to a device or a third device, or transmitted from a third device to a device.
  • one of the server, the device, and the third device may execute the computer program product to perform the method according to the disclosed embodiments.
  • two or more of a server, a device, and a third device may execute a computer program product to distribute and implement the method according to the disclosed embodiments.
  • a server eg, a cloud server or an artificial intelligence server
  • the third device may execute a computer program product, and a device that is connected in communication with the third device may be controlled to perform the method according to the disclosed embodiment.
  • the third device may download the computer program product from the server and execute the downloaded computer program product.
  • the third device may perform the method according to the disclosed embodiments by executing the computer program product provided in a preloaded state.
  • the "unit” may be a hardware component such as a processor or a circuit, and/or a software component executed by a hardware configuration such as a processor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

La présente invention concerne un procédé de fourniture de service vocal et une application associée. Selon un mode de réalisation de la présente invention, un procédé de fonctionnement de dispositif électronique comprend les étapes consistant : à acquérir des informations relatives à un individu pour lequel la voix doit être stockée; à extraire, sur la base des informations acquises sur l'individu, la voix de l'individu par détection de l'état de parole de l'individu à partir d'une image qui comprend l'individu; à stocker, en tant que résultat de modélisation de la voix extraite, des informations vocales relatives à la voix de l'individu et la voix de l'individu; et à fournir un service vocal sur la base des informations vocales stockées.
PCT/KR2020/012559 2019-10-25 2020-09-17 Procédé et dispositif de fourniture de service vocal WO2021080190A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020190134100A KR20210049601A (ko) 2019-10-25 2019-10-25 음성 서비스 제공 방법 및 장치
KR10-2019-0134100 2019-10-25

Publications (1)

Publication Number Publication Date
WO2021080190A1 true WO2021080190A1 (fr) 2021-04-29

Family

ID=75619879

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/012559 WO2021080190A1 (fr) 2019-10-25 2020-09-17 Procédé et dispositif de fourniture de service vocal

Country Status (2)

Country Link
KR (1) KR20210049601A (fr)
WO (1) WO2021080190A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020068235A (ko) * 2001-02-20 2002-08-27 유재천 치아와 입술 영상을 이용한 음성인식 장치 및 방법
JP2004333738A (ja) * 2003-05-06 2004-11-25 Nec Corp 映像情報を用いた音声認識装置及び方法
JP2005006181A (ja) * 2003-06-13 2005-01-06 Nippon Telegr & Teleph Corp <Ntt> 映像音声とシナリオテキストとの整合方法および装置、並びに前記方法を記録した記憶媒体とコンピュータソフトウェア
KR20160131505A (ko) * 2015-05-07 2016-11-16 주식회사 셀바스에이아이 음성 변환 방법 및 음성 변환 장치
KR20190059046A (ko) * 2017-11-22 2019-05-30 (주)알앤디테크 음성 인식 시스템

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020068235A (ko) * 2001-02-20 2002-08-27 유재천 치아와 입술 영상을 이용한 음성인식 장치 및 방법
JP2004333738A (ja) * 2003-05-06 2004-11-25 Nec Corp 映像情報を用いた音声認識装置及び方法
JP2005006181A (ja) * 2003-06-13 2005-01-06 Nippon Telegr & Teleph Corp <Ntt> 映像音声とシナリオテキストとの整合方法および装置、並びに前記方法を記録した記憶媒体とコンピュータソフトウェア
KR20160131505A (ko) * 2015-05-07 2016-11-16 주식회사 셀바스에이아이 음성 변환 방법 및 음성 변환 장치
KR20190059046A (ko) * 2017-11-22 2019-05-30 (주)알앤디테크 음성 인식 시스템

Also Published As

Publication number Publication date
KR20210049601A (ko) 2021-05-06

Similar Documents

Publication Publication Date Title
WO2018034552A1 (fr) Dispositif et procédé de traduction de langue
WO2015111845A1 (fr) Dispositif électronique et procédé de reconnaissance vocale associé
US11487502B2 (en) Portable terminal device and information processing system
WO2018174437A1 (fr) Dispositif électronique et procédé de commande associé
WO2016028042A1 (fr) Procédé de fourniture d&#39;une image visuelle d&#39;un son et dispositif électronique mettant en œuvre le procédé
WO2019124742A1 (fr) Procédé de traitement de signaux vocaux de multiples haut-parleurs, et dispositif électronique associé
WO2020122677A1 (fr) Procédé d&#39;exécution de fonction de dispositif électronique et dispositif électronique l&#39;utilisant
WO2020162709A1 (fr) Dispositif électronique pour la fourniture de données graphiques basées sur une voix et son procédé de fonctionnement
WO2020159288A1 (fr) Dispositif électronique et son procédé de commande
WO2016036143A1 (fr) Procédé de traitement de données multimédias d&#39;un dispositif électronique et dispositif électronique associé
WO2019112342A1 (fr) Appareil de reconnaissance vocale et son procédé de fonctionnement
WO2018182201A1 (fr) Procédé et dispositif de fourniture de réponse à une entrée vocale d&#39;utilisateur
WO2020045835A1 (fr) Dispositif électronique et son procédé de commande
WO2019112181A1 (fr) Dispositif électronique pour exécuter une application au moyen d&#39;informations de phonème comprises dans des données audio, et son procédé de fonctionnement
WO2015199430A1 (fr) Procédé et appareil de gestion de données
WO2019107719A1 (fr) Dispositif et procédé pour afficher visuellement la voix d&#39;un locuteur dans une vidéo à 360 degrés
WO2021172832A1 (fr) Procédé de modification d&#39;image basée sur la reconnaissance des gestes, et dispositif électronique prenant en charge celui-ci
WO2020101174A1 (fr) Procédé et appareil pour produire un modèle de lecture sur les lèvres personnalisé
WO2021080190A1 (fr) Procédé et dispositif de fourniture de service vocal
WO2022169039A1 (fr) Appareil électronique et son procédé de commande
WO2021107308A1 (fr) Dispositif électronique et son procédé de commande
WO2021256760A1 (fr) Dispositif électronique mobile et son procédé de commande
WO2020138943A1 (fr) Appareil et procédé de reconnaissance vocale
WO2020204357A1 (fr) Dispositif électronique et son procédé de commande
WO2020075998A1 (fr) Dispositif électronique et son procédé de commande

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20879749

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20879749

Country of ref document: EP

Kind code of ref document: A1