WO2024085290A1 - Dispositif d'intelligence artificielle, et procédé de commande associé - Google Patents

Dispositif d'intelligence artificielle, et procédé de commande associé Download PDF

Info

Publication number
WO2024085290A1
WO2024085290A1 PCT/KR2022/016193 KR2022016193W WO2024085290A1 WO 2024085290 A1 WO2024085290 A1 WO 2024085290A1 KR 2022016193 W KR2022016193 W KR 2022016193W WO 2024085290 A1 WO2024085290 A1 WO 2024085290A1
Authority
WO
WIPO (PCT)
Prior art keywords
artificial intelligence
data
image data
voice
intelligence device
Prior art date
Application number
PCT/KR2022/016193
Other languages
English (en)
Korean (ko)
Inventor
김성진
허진영
전영혁
김중락
허정
이재훈
Original Assignee
엘지전자 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 엘지전자 주식회사 filed Critical 엘지전자 주식회사
Priority to PCT/KR2022/016193 priority Critical patent/WO2024085290A1/fr
Publication of WO2024085290A1 publication Critical patent/WO2024085290A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/106Display of layout of documents; Previewing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/802D [Two Dimensional] animation, e.g. using sprites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • This disclosure relates to an artificial intelligence device that provides a phototoon service for a predetermined unit of data in video data and a method of operating the same.
  • the purpose of this disclosure is to provide an artificial intelligence device that provides a photo-toon service based on voice recognition technology and a method of operating the same.
  • a method of operating an artificial intelligence device includes: detecting an event; extracting at least one image data constituting the video data according to the event; Extracting voice data corresponding to the image data and STT processing it; combining the STT-processed data and the image data into one image; and outputting the synthesized image.
  • the event may include receiving a phototoon service request signal.
  • the at least one image data corresponds to any one of a frame, a scene, and a sequence unit that is a set of a plurality of scenes. It could be data.
  • the at least one image data may be determined based on an object in the video data.
  • the method includes detecting a face from the at least one image data; If the size of the detected face exceeds a threshold, recognizing the direction of the face; Recognizing the position of the mouth of the face; determining a position of a speech bubble that will contain the STT-processed data according to the direction of the face and the position of the mouth recognized with respect to the detected face; and compositing the image data so that a speech bubble containing the STT-processed data is positioned at the determined position of the speech balloon.
  • the speech bubble when a face is not detected from the at least one image data, the speech bubble is positioned to be output in one area of the screen. It may further include determining and compositing with the image data.
  • the at least one image data may correspond to a scene change section in the video or a sound output section.
  • the plurality of composite images are grouped and summarized according to a predefined standard. Therefore, only some composite images may be output.
  • An artificial intelligence device includes a display that outputs video data; and a processor that controls the display, wherein the processor detects an event, extracts at least one image data constituting the video data according to the event, and extracts voice data corresponding to the image data to perform STT. processing, the STT-processed data and the image data can be combined into one image to output a composite image.
  • the event includes receiving a phototoon service request signal, and the at least one image data is a frame, a scene, and a set of a plurality of scenes. It may be data corresponding to any one of sequence units.
  • the processor may determine the at least one image data based on an object in the video data.
  • the processor detects a face from the at least one image data, and when the size of the detected face exceeds a threshold, By recognizing the direction of the face and the position of the mouth, the position of the speech bubble that will contain the STT-processed data is determined according to the direction of the face and the mouth position recognized for the detected face, and the STT-processed data is placed at the position of the determined speech bubble. It can be combined with the image data so that a speech bubble containing the data is positioned.
  • the processor outputs the speech bubble to one area of the screen when a face is not detected from the at least one image data.
  • the location can be determined and combined with the image data.
  • the at least one image data may correspond to a scene change section in the video or a sound output section.
  • the processor when there are a plurality of composite images for the video data, the processor groups the plurality of composite images according to predefined criteria and In summary, only some composite images can be output.
  • FIG. 1 is a diagram for explaining a voice system according to an embodiment of the present invention.
  • Figure 2 is a block diagram for explaining the configuration of an artificial intelligence device according to an embodiment of the present disclosure.
  • Figure 3 is a block diagram for explaining the configuration of a voice service server according to an embodiment of the present invention.
  • Figure 4 is a diagram illustrating an example of converting a voice signal into a power spectrum according to an embodiment of the present invention.
  • Figure 5 is a block diagram illustrating the configuration of a processor for voice recognition and synthesis of an artificial intelligence device, according to an embodiment of the present invention.
  • Figure 6 is a block diagram of a voice service system for providing a voice recognition-based phototoon service according to an embodiment of the present disclosure.
  • Figure 7 is a block diagram of the processor of Figure 6.
  • FIGS. 8 to 11 are flowcharts illustrating a method of providing a phototoon service according to the present disclosure.
  • Figures 12 to 14 are diagrams to explain a method of providing a phototoon service according to an embodiment of the present disclosure.
  • FIG. 15 is a diagram illustrating a method of providing a phototoon service using voice recognition technology according to an embodiment of the present disclosure.
  • Figures 16a and 16b are diagrams to explain a method of providing a phototoon service using voice recognition technology according to an embodiment of the present disclosure.
  • 'Artificial intelligence devices' described in this specification include mobile phones, smart phones, laptop computers, artificial intelligence devices for digital broadcasting, personal digital assistants (PDAs), portable multimedia players (PMPs), navigation, and slates.
  • PDAs personal digital assistants
  • PMPs portable multimedia players
  • PC slate PC
  • tablet PC tablet PC
  • ultrabook wearable device (e.g., watch-type artificial intelligence device (smartwatch), glass-type artificial intelligence device (smart glass), HMD ( head mounted display)), etc.
  • wearable device e.g., watch-type artificial intelligence device (smartwatch), glass-type artificial intelligence device (smart glass), HMD ( head mounted display)
  • HMD head mounted display
  • artificial intelligence devices may also be applied to fixed artificial intelligence devices such as smart TVs, desktop computers, digital signage, refrigerators, washing machines, air conditioners, and dishwashers.
  • the artificial intelligence device 10 can also be applied to a fixed or movable robot.
  • the artificial intelligence device 10 can perform the function of a voice agent.
  • a voice agent may be a program that recognizes the user's voice and outputs a response appropriate for the recognized user's voice as a voice.
  • FIG. 1 is a diagram for explaining a voice service system according to an embodiment of the present invention.
  • the voice service may include at least one of voice recognition and voice synthesis services.
  • the speech recognition and synthesis process converts the speaker's (or user's) voice data into text data, analyzes the speaker's intention based on the converted text data, and converts the text data corresponding to the analyzed intention into synthesized voice data. , It may include a process of outputting the converted synthesized voice data.
  • a voice service system as shown in Figure 1, can be used.
  • the voice service system includes an artificial intelligence device (10), a speech-to-text (STT) server (20), a Natural Language Processing (NLP) server (30), and a voice synthesis server ( 40) may be included.
  • a plurality of AI agent servers 50-1 to 50-3 communicate with the NLP server 30 and may be included in the voice service system.
  • the STT server 20, NLP server 30, and voice synthesis server 40 may exist as separate servers as shown, or may be included in one server.
  • a plurality of AI agent servers 50-1 to 50-3 may also exist as separate servers or may be included in the NLP server 30.
  • the artificial intelligence device 10 may transmit a voice signal corresponding to the speaker's voice received through the microphone 122 to the STT server 20.
  • the STT server 20 can convert voice data received from the artificial intelligence device 10 into text data.
  • the STT server 20 can increase the accuracy of voice-to-text conversion by using a language model.
  • a language model can refer to a model that can calculate the probability of a sentence or the probability of the next word appearing given the previous words.
  • the language model may include probabilistic language models such as Unigram model, Bigram model, N-gram model, etc.
  • the unigram model is a model that assumes that the usage of all words is completely independent of each other, and calculates the probability of a word string as the product of the probability of each word.
  • the bigram model is a model that assumes that the use of a word depends only on the previous word.
  • the N-gram model is a model that assumes that the usage of a word depends on the previous (n-1) words.
  • the STT server 20 can use the language model to determine whether text data converted from voice data has been appropriately converted, and through this, the accuracy of conversion to text data can be increased.
  • the NLP server 30 may receive text data from the STT server 20.
  • the STT server 20 may be included in the NLP server 30.
  • the NLP server 30 may perform intent analysis on text data based on the received text data.
  • the NLP server 30 may transmit intention analysis information indicating the result of intention analysis to the artificial intelligence device 10.
  • the NLP server 30 may transmit intention analysis information to the voice synthesis server 40.
  • the voice synthesis server 40 may generate a synthesized voice based on intent analysis information and transmit the generated synthesized voice to the artificial intelligence device 10.
  • the NLP server 30 may generate intention analysis information by sequentially performing a morpheme analysis step, a syntax analysis step, a dialogue act analysis step, and a dialogue processing step on text data.
  • the morpheme analysis step is a step that classifies text data corresponding to the voice uttered by the user into morpheme units, which are the smallest units with meaning, and determines what part of speech each classified morpheme has.
  • the syntax analysis step is a step that uses the results of the morpheme analysis step to classify text data into noun phrases, verb phrases, adjective phrases, etc., and determines what kind of relationship exists between each classified phrase.
  • the subject, object, and modifiers of the voice uttered by the user can be determined.
  • the speech act analysis step is a step of analyzing the intention of the voice uttered by the user using the results of the syntax analysis step. Specifically, the speech act analysis step is to determine the intent of the sentence, such as whether the user is asking a question, making a request, or simply expressing an emotion.
  • the conversation processing step is a step that uses the results of the dialogue act analysis step to determine whether to reply to the user's utterance, respond to it, or ask a question for additional information.
  • the NLP server 30 may generate intention analysis information including one or more of a response to the intention uttered by the user, a response, and an inquiry for additional information.
  • the NLP server 30 may transmit a search request to a search server (not shown) and receive search information corresponding to the search request in order to search for information that matches the user's utterance intention.
  • the search information may include information about the searched content.
  • the NLP server 30 transmits search information to the artificial intelligence device 10, and the artificial intelligence device 10 can output the search information.
  • the NLP server 30 may receive text data from the artificial intelligence device 10. For example, if the artificial intelligence device 10 supports a voice-to-text conversion function, the artificial intelligence device 10 converts voice data into text data and transmits the converted text data to the NLP server 30. .
  • the voice synthesis server 40 can generate a synthesized voice by combining pre-stored voice data.
  • the voice synthesis server 40 can record the voice of a person selected as a model and divide the recorded voice into syllables or words.
  • the voice synthesis server 40 can store the segmented voice in units of syllables or words in an internal or external database.
  • the voice synthesis server 40 may search for syllables or words corresponding to given text data from a database, synthesize a combination of the searched syllables or words, and generate a synthesized voice.
  • the voice synthesis server 40 may store a plurality of voice language groups corresponding to each of a plurality of languages.
  • the speech synthesis server 40 may include a first audio language group recorded in Korean and a second audio language group recorded in English.
  • the speech synthesis server 40 may translate text data in the first language into text in the second language and generate synthesized speech corresponding to the translated text in the second language using the second speech language group.
  • the voice synthesis server 40 can transmit the generated synthesized voice to the artificial intelligence device 10.
  • the voice synthesis server 40 may receive analysis information from the NLP server 30.
  • the analysis information may include information analyzing the intention of the voice uttered by the user.
  • the voice synthesis server 40 may generate a synthesized voice that reflects the user's intention based on the analysis information.
  • the functions of the STT server 20, NLP server 30, and voice synthesis server 40 described above may also be performed by the artificial intelligence device 10.
  • the artificial intelligence device 10 may include one or more processors.
  • Each of the plurality of AI agent servers 50-1 to 50-3 may transmit search information to the NLP server 30 or the artificial intelligence device 10 according to a request from the NLP server 30.
  • the NLP server 30 transmits the content search request to one or more of the plurality of AI agent servers 50-1 to 50-3, , content search results can be received from the corresponding server.
  • the NLP server 30 may transmit the received search results to the artificial intelligence device 10.
  • Figure 2 is a block diagram for explaining the configuration of an artificial intelligence device 10 according to an embodiment of the present disclosure.
  • the artificial intelligence device 10 includes a communication unit 110, an input unit 120, a learning processor 130, a sensing unit 140, an output unit 150, a memory 170, and a processor 180. may include.
  • the communication unit 110 can transmit and receive data with external devices using wired and wireless communication technology.
  • the communication unit 110 may transmit and receive sensor information, user input, learning models, and control signals with external devices.
  • communication technologies used by the communication unit 110 include GSM (Global System for Mobile communication), CDMA (Code Division Multi Access), LTE (Long Term Evolution), LTV-A (advanced), 5G, WLAN (Wireless LAN), These include Wi-Fi (Wireless-Fidelity), BluetoothTM, RFID (Radio Frequency Identification), Infrared Data Association (IrDA), ZigBee, and NFC (Near Field Communication).
  • GSM Global System for Mobile communication
  • CDMA Code Division Multi Access
  • LTE Long Term Evolution
  • LTV-A long Term Evolution
  • 5G Fifth Generation
  • WLAN Wireless LAN
  • Wi-Fi Wireless-Fidelity
  • BluetoothTM BluetoothTM
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • ZigBee ZigBee
  • NFC Near Field Communication
  • the input unit 120 can acquire various types of data.
  • the input unit 120 may include a camera for inputting video signals, a microphone for receiving audio signals, and a user input unit for receiving information from a user.
  • the camera or microphone may be treated as a sensor, and the signal obtained from the camera or microphone may be referred to as sensing data or sensor information.
  • the input unit 120 may acquire training data for model learning and input data to be used when obtaining an output using the learning model.
  • the input unit 120 may acquire unprocessed input data, and in this case, the processor 180 or the learning processor 130 may extract input features by preprocessing the input data.
  • the input unit 120 may include a camera 121 for inputting video signals, a microphone 122 for receiving audio signals, and a user input unit 123 for receiving information from the user. there is.
  • Voice data or image data collected by the input unit 120 may be analyzed and processed as a user's control command.
  • the input unit 120 is for inputting image information (or signal), audio information (or signal), data, or information input from the user. To input image information, one or more artificial intelligence devices 10 are used. of cameras 121 may be provided.
  • the camera 121 processes image frames such as still images or moving images obtained by an image sensor in video call mode or shooting mode.
  • the processed image frame may be displayed on the display unit 151 or stored in the memory 170.
  • the microphone 122 processes external acoustic signals into electrical voice data.
  • Processed voice data can be used in various ways depending on the function (or application program being executed) being performed by the artificial intelligence device 10. Meanwhile, various noise removal algorithms may be applied to the microphone 122 to remove noise generated in the process of receiving an external acoustic signal.
  • the user input unit 123 is for receiving information from the user.
  • the processor 180 can control the operation of the artificial intelligence device 10 to correspond to the input information. there is.
  • the user input unit 123 is a mechanical input means (or mechanical key, such as a button, dome switch, jog wheel, jog switch, etc. located on the front/rear or side of the terminal 100) and It may include a touch input means.
  • the touch input means consists of a virtual key, soft key, or visual key displayed on the touch screen through software processing, or a part other than the touch screen. It can be done with a touch key placed in .
  • the learning processor 130 can train a model composed of an artificial neural network using training data.
  • the learned artificial neural network may be referred to as a learning model.
  • a learning model can be used to infer a result value for new input data other than learning data, and the inferred value can be used as the basis for a decision to perform an operation.
  • the learning processor 130 may include memory integrated or implemented in the artificial intelligence device 10. Alternatively, the learning processor 130 may be implemented using the memory 170, an external memory directly coupled to the artificial intelligence device 10, or a memory maintained in an external device.
  • the sensing unit 140 may use various sensors to obtain at least one of internal information of the artificial intelligence device 10, information about the surrounding environment of the artificial intelligence device 10, and user information.
  • the sensors included in the sensing unit 140 include a proximity sensor, illuminance sensor, acceleration sensor, magnetic sensor, gyro sensor, inertial sensor, RGB sensor, IR sensor, fingerprint recognition sensor, ultrasonic sensor, light sensor, microphone, and lidar. , radar, etc.
  • the output unit 150 may generate output related to vision, hearing, or tactile sensation.
  • the output unit 150 includes at least one of a display unit (Display Unit, 151), a sound output unit (152), a haptic module (153), and an optical output unit (Optical Output Unit, 154). It can be included.
  • the display unit 151 displays (outputs) information processed by the artificial intelligence device 10.
  • the display unit 151 may display execution screen information of an application running on the artificial intelligence device 10, or UI (User Interface) and GUI (Graphic User Interface) information according to such execution screen information.
  • UI User Interface
  • GUI Graphic User Interface
  • the display unit 151 can implement a touch screen by forming a layered structure or being integrated with the touch sensor.
  • This touch screen functions as a user input unit 123 that provides an input interface between the artificial intelligence device 10 and the user, and can simultaneously provide an output interface between the terminal 100 and the user.
  • the audio output unit 152 may output audio data received from the communication unit 110 or stored in the memory 170 in call signal reception, call mode or recording mode, voice recognition mode, broadcast reception mode, etc.
  • the sound output unit 152 may include at least one of a receiver, a speaker, and a buzzer.
  • the haptic module 153 generates various tactile effects that the user can feel.
  • a representative example of a tactile effect generated by the haptic module 153 may be vibration.
  • the optical output unit 154 uses light from the light source of the artificial intelligence device 10 to output a signal to notify the occurrence of an event. Examples of events that occur in the artificial intelligence device 10 may include receiving a message, receiving a call signal, missed call, alarm, schedule notification, receiving email, receiving information through an application, etc.
  • the memory 170 can store data supporting various functions of the artificial intelligence device 10.
  • the memory 170 may store input data, learning data, learning models, learning history, etc. obtained from the input unit 120.
  • the processor 180 may determine at least one executable operation of the artificial intelligence device 10 based on information determined or generated using a data analysis algorithm or a machine learning algorithm. And the processor 180 can control the components of the artificial intelligence device 10 to perform the determined operation.
  • the processor 180 may request, retrieve, receive, or utilize data from the learning processor 130 or the memory 170, and may artificially execute an operation that is predicted or determined to be desirable among the at least one executable operation. Components of the intelligent device 10 can be controlled.
  • the processor 180 may generate a control signal to control the external device and transmit the generated control signal to the external device.
  • the processor 180 may obtain intent information for user input and determine the user's request based on the obtained intent information.
  • the processor 180 may obtain intent information corresponding to the user input by using at least one of an STT engine for converting voice input into a character string or an NLP engine for obtaining intent information of natural language.
  • At least one of the STT engine and the NLP engine may be composed of at least a portion of an artificial neural network learned according to a machine learning algorithm. And, at least one of the STT engine or the NLP engine is learned by the learning processor 130, learned by the learning processor 240 of the AI server 200, or learned by distributed processing thereof. It could be.
  • the processor 180 collects history information including the user's feedback on the operation of the artificial intelligence device 10 and stores it in the memory 170 or the learning processor 130 or the AI server 200, etc. Can be transmitted to external devices. The collected historical information can be used to update the learning model.
  • the processor 180 may control at least some of the components of the artificial intelligence device 10 to run an application program stored in the memory 170. Furthermore, the processor 180 may operate two or more of the components included in the artificial intelligence device 10 in combination with each other in order to run the application program.
  • Figure 3 is a block diagram for explaining the configuration of the voice service server 200 according to an embodiment of the present invention.
  • the voice service server 200 may include one or more of the STT server 20, NLP server 30, and voice synthesis server 40 shown in FIG. 1.
  • the voice service server 200 may be referred to as a server system.
  • the voice service server 200 may include a preprocessor 220, a controller 230, a communication unit 270, and a database 290.
  • the preprocessing unit 220 may preprocess the voice received through the communication unit 270 or the voice stored in the database 290.
  • the preprocessing unit 220 may be implemented as a separate chip from the controller 230 or may be implemented as a chip included in the controller 230.
  • the preprocessor 220 may receive a voice signal (uttered by a user) and filter noise signals from the voice signal before converting the received voice signal into text data.
  • the preprocessor 220 If the preprocessor 220 is provided in the artificial intelligence device 10, it can recognize a startup word for activating voice recognition of the artificial intelligence device 10.
  • the preprocessor 220 converts the startup word received through the microphone 121 into text data, and if the converted text data is text data corresponding to a pre-stored startup word, it may be determined that the startup word has been recognized. .
  • the preprocessor 220 may convert the noise-removed voice signal into a power spectrum.
  • the power spectrum may be a parameter that indicates which frequency components and at what magnitude are included in the temporally varying waveform of a voice signal.
  • the power spectrum shows the distribution of squared amplitude values according to the frequency of the waveform of the voice signal.
  • Figure 4 is a diagram illustrating an example of converting a voice signal into a power spectrum according to an embodiment of the present invention.
  • the voice signal 410 may be received from an external device or may be a signal previously stored in the memory 170.
  • the x-axis of the voice signal 310 may represent time, and the y-axis may represent amplitude.
  • the power spectrum processor 225 may convert the voice signal 410, where the x-axis is the time axis, into a power spectrum 430, where the x-axis is the frequency axis.
  • the power spectrum processor 225 may convert the voice signal 410 into a power spectrum 430 using Fast Fourier Transform (FFT).
  • FFT Fast Fourier Transform
  • the x-axis of the power spectrum 430 represents frequency, and the y-axis represents the square value of amplitude.
  • the functions of the preprocessor 220 and the controller 230 described in FIG. 3 can also be performed by the NLP server 30.
  • the pre-processing unit 220 may include a wave processing unit 221, a frequency processing unit 223, a power spectrum processing unit 225, and an STT converting unit 227.
  • the wave processing unit 221 can extract the waveform of the voice.
  • the frequency processing unit 223 can extract the frequency band of the voice.
  • the power spectrum processing unit 225 can extract the power spectrum of the voice.
  • the power spectrum may be a parameter that indicates which frequency components and at what size are included in the waveform.
  • the STT converter 227 can convert voice into text.
  • the STT conversion unit 227 can convert voice in a specific language into text in that language.
  • the controller 230 can control the overall operation of the voice service server 200.
  • the controller 230 may include a voice analysis unit 231, a text analysis unit 232, a feature clustering unit 233, a text mapping unit 234, and a voice synthesis unit 235.
  • the voice analysis unit 231 may extract voice characteristic information using one or more of the voice waveform, voice frequency band, and voice power spectrum preprocessed in the preprocessor 220.
  • the voice characteristic information may include one or more of the speaker's gender information, the speaker's voice (or tone), the pitch of the sound, the speaker's speaking style, the speaker's speech speed, and the speaker's emotion.
  • the voice characteristic information may further include the speaker's timbre.
  • the text analysis unit 232 may extract key expressions from the text converted by the speech-to-text conversion unit 227.
  • the text analysis unit 232 When the text analysis unit 232 detects a change in tone between phrases from the converted text, it can extract the phrase with a different tone as the main expression phrase.
  • the text analysis unit 232 may determine that the tone has changed when the frequency band between the phrases changes more than a preset band.
  • the text analysis unit 232 may extract key words from phrases in the converted text.
  • a key word may be a noun that exists within a phrase, but this is only an example.
  • the feature clustering unit 233 can classify the speaker's speech type using the voice characteristic information extracted from the voice analysis unit 231.
  • the feature clustering unit 233 may classify the speaker's utterance type by assigning a weight to each type item constituting the voice characteristic information.
  • the feature clustering unit 233 can classify the speaker's utterance type using the attention technique of a deep learning model.
  • the text mapping unit 234 may translate the text converted into the first language into the text of the second language.
  • the text mapping unit 234 may map the text translated into the second language with the text of the first language.
  • the text mapping unit 234 can map key expressions constituting the text in the first language to corresponding phrases in the second language.
  • the text mapping unit 234 may map the utterance type corresponding to the main expression phrases constituting the text of the first language to phrases of the second language. This is to apply the classified utterance type to the phrases of the second language.
  • the voice synthesis unit 235 applies the utterance type and speaker's tone classified by the feature clustering unit 233 to the main expressions of the text translated into the second language in the text mapping unit 234, and creates a synthesized voice. can be created.
  • the controller 230 may determine the user's speech characteristics using one or more of the delivered text data or the power spectrum 430.
  • the user's speech characteristics may include the user's gender, the user's pitch, the user's tone, the user's speech topic, the user's speech speed, and the user's voice volume.
  • the controller 230 may use the power spectrum 430 to obtain the frequency of the voice signal 410 and the amplitude corresponding to the frequency.
  • the controller 230 can determine the gender of the user who uttered the voice using the frequency band of the power spectrum 430.
  • the controller 230 may determine the user's gender as male.
  • the controller 230 may determine the user's gender as female.
  • the second frequency band range may be larger than the first frequency band range.
  • the controller 230 can determine the pitch of the voice using the frequency band of the power spectrum 430.
  • the controller 230 may determine the pitch of the sound according to the size of the amplitude within a specific frequency band.
  • the controller 230 may determine the user's tone using the frequency band of the power spectrum 430. For example, the controller 230 may determine a frequency band with an amplitude greater than a certain level among the frequency bands of the power spectrum 430 as the user's main sound range, and determine the determined main sound range as the user's tone.
  • the controller 230 may determine the user's speech rate based on the number of syllables uttered per unit time from the converted text data.
  • the controller 230 can determine the topic of the user's speech using the Bag-Of-Word Model technique for the converted text data.
  • the Bag-Of-Word Model technique is a technique to extract frequently used words based on the frequency of words in a sentence.
  • the Bag-Of-Word Model technique is a technique that extracts unique words within a sentence and expresses the frequency of each extracted word as a vector to determine the characteristics of the topic of speech.
  • the topic of the user's speech may be classified as exercise.
  • the controller 230 can determine the topic of the user's speech from text data using a known text categorization technique.
  • the controller 230 can extract keywords from text data and determine the topic of the user's speech.
  • the controller 230 can determine the user's voice volume by considering amplitude information in the entire frequency band.
  • the user's voice quality can be determined based on the average or weighted average of the amplitude in each frequency band of the power spectrum of the controller 230.
  • the communication unit 270 may communicate with an external server by wire or wirelessly.
  • the database 290 may store the voice of the first language included in the content.
  • the database 290 may store a synthesized voice in which the voice of the first language is converted into the voice of the second language.
  • the database 290 may store a first text corresponding to a voice in the first language and a second text in which the first text is translated into the second language.
  • the database 290 may store various learning models required for voice recognition.
  • the processor 180 of the artificial intelligence device 10 shown in FIG. 2 may include the preprocessor 220 and the controller 230 shown in FIG. 3.
  • the processor 180 of the artificial intelligence device 10 may perform the functions of the preprocessor 220 and the controller 230.
  • Figure 5 is a block diagram illustrating the configuration of a processor for voice recognition and synthesis of the artificial intelligence device 10, according to an embodiment of the present invention.
  • the voice recognition and synthesis process of FIG. 5 may be performed by the learning processor 130 or processor 180 of the artificial intelligence device 10 without going through the server.
  • the processor 180 of the artificial intelligence device 10 may include an STT engine 510, an NLP engine 530, and a voice synthesis engine 550.
  • Each engine can be either hardware or software.
  • the STT engine 510 may perform the function of the STT server 20 of FIG. 1. That is, the STT engine 510 can convert voice data into text data.
  • the NLP engine 530 may perform the functions of the NLP server 30 of FIG. 1. That is, the NLP engine 530 can obtain intention analysis information indicating the speaker's intention from the converted text data.
  • the voice synthesis engine 550 may perform the function of the voice synthesis server 40 of FIG. 1.
  • the speech synthesis engine 550 may search a database for syllables or words corresponding to given text data, synthesize a combination of the searched syllables or words, and generate a synthesized voice.
  • the voice synthesis engine 550 may include a preprocessing engine 551 and a TTS engine 553.
  • the preprocessing engine 551 may preprocess text data before generating synthetic speech.
  • the preprocessing engine 551 performs tokenization by dividing text data into tokens, which are meaningful units.
  • the preprocessing engine 551 may perform a cleansing operation to remove unnecessary characters and symbols to remove noise.
  • the preprocessing engine 551 can generate the same word token by integrating word tokens with different expression methods.
  • the preprocessing engine 551 may remove meaningless word tokens (stopwords).
  • the TTS engine 553 can synthesize speech corresponding to preprocessed text data and generate synthesized speech.
  • voice service technology e.g., voice recognition, voice synthesis, etc.
  • video data of various lengths based on various platforms consumed by artificial intelligence devices will be described. do.
  • Phototoon described in this disclosure is a compound word of photo and toon, and is an image (still image or video format) for a desired portion (e.g., all or part) of video data provided through the artificial intelligence device 10. ) is acquired, the corresponding voice data is converted to text, and then a composite image (still image or video format) is displayed by combining the acquired image with the converted text.
  • the process of creating and providing a phototoon for video data in the artificial intelligence device 10 is referred to as a ‘phototoon service’.
  • the present disclosure is not limited to the above terms.
  • the artificial intelligence device 10 can provide a summary service (summary or summary data) for desired portions of target video data through the phototoon service.
  • the phototoon service may be provided in such a way that the target video data is output as is, but the phototoon composite image is output only in a specific section, that is, the phototoon service section.
  • the phototoon service may be provided in such a way that a phototoon composite image for a specific section is generated separately from the playback of the target video, and only the phototoon service that outputs only the phototoon composite image is output.
  • a plurality of phototoon service sections or a plurality of phototoon composite images may be generated and service provided for one target video.
  • the artificial intelligence device 10 may sense an event, for example, skip each phototoon service section according to the user's request through a remote control device, and provide a service to the user so that he or she can consume the target video.
  • the artificial intelligence device 10 distinguishes sections (areas) available for phototoon service within the target video, and allows the user to identify and identify each divided phototoon service section. Can be provided for selection.
  • the artificial intelligence device 10 can list them and provide them for selection, and output the selected phototoon service data.
  • phototoon composite data can be generated in units of desired sections.
  • the ‘desired section’ may represent, for example, a frame, a scene, or a sequence unit composed of a plurality of scenes.
  • the artificial intelligence device 10 provides phototoon composite data only for some scene(s) (or main scenes), not all scenes constituting the sequence. can be created. However, it is not necessarily limited to the above contents.
  • Voice recognition technology may be processed by an STT engine (and NLP engine) provided in the artificial intelligence device 10, but is not necessarily limited to this.
  • voice recognition technology may be processed through the STT server 20 and NLP server 30 in the voice service server 200 and transmitted to the artificial intelligence device 10.
  • the artificial intelligence device (10) creates and provides a phototoon service menu item on the dashboard or menu of various artificial intelligence devices (10) so that users can easily enter and use the phototoon service, or provides an application dedicated to the phototoon service. It can be downloaded and installed for use. Alternatively, when an event request such as selection or playback of a video of a preset length or longer is received, the artificial intelligence device 10 may provide an icon or an OSD message (On Screen Display message) as a guide for using the phototoon service. there is.
  • OSD message On Screen Display message
  • the voice service server 200 can provide a phototoon service platform and can support or guide the use of the phototoon service for target video data in the form of a web service or web app through the artificial intelligence device 10.
  • Figure 6 is a block diagram of a voice service system for providing a voice recognition-based phototoon service according to an embodiment of the present disclosure.
  • FIG. 7 is a block diagram of the processor 620 of FIG. 6.
  • a voice service system for providing a phototoon service based on a voice recognition function may be configured to include an artificial intelligence device 10.
  • the voice service server 200 may replace all or part of the functions related to the phototoon service of the artificial intelligence device 10.
  • the artificial intelligence device 10 may include an output unit 150 and a processing unit 600 that output phototoon service data and/or video data including phototoon service data.
  • the processing unit 600 may include a memory 610 and a processor 620.
  • the processor 620 controls the overall functions of the processing unit 600 and can perform operations to provide the phototoon service.
  • the processor 620 includes a data reception unit 710, a detection unit 720, a voice recognition engine 730, a synthesis unit 740, and a control unit 750 to provide the phototoon service. You can.
  • at least one of the various components constituting the processor 620 may be implemented in the form of a plurality of modules, unlike shown.
  • the processor 620 may further include at least one component not shown in FIG. 7.
  • the data receiver 710 may receive video data, identify a phototoon service request section (or a phototoon service-capable candidate section), and process the identified phototoon service-capable candidate sections by dividing them into predetermined units.
  • the predetermined unit may be the above-mentioned frame unit, scene unit, sequence unit, etc. This distinction can be made only for the target video to which the phototoon service is applied or the phototoon service request section of the target video.
  • the detection unit 720 can detect phototoon service-related information for a predetermined unit within the target video data.
  • the information detected in this way may include at least one of scene/sequence change information, main scene information, facial feature information, face-based representative scene information, and voice information.
  • the detection unit 720 may include a preprocessing module, a learning module, etc., and can automatically detect at least one of the above-described information by learning the generated artificial intelligence model related to the phototoon service.
  • the voice recognition engine 730 includes an STT engine and can convert voice information corresponding to image information detected through the detector 720 into text information. As described above, depending on the embodiment, the function of the voice recognition engine 730 may be performed by the STT server 20 in the voice recognition server 200, and in this case, in FIG. 7, the voice recognition engine 730 is It can be disabled or excluded from configuration.
  • the synthesis unit 740 can process and synthesize the image information detected by the detection unit 720 and the text information converted through the voice recognition engine 730 so that they are in sync.
  • the control unit 750 may control the overall operation and functions of the processor 620.
  • the control unit 750 can control each of the above components to provide the phototoon service according to the present disclosure to the target video.
  • the processor 620 may have the same configuration as the processor 180 of FIG. 2, but may also have a separate configuration.
  • the artificial intelligence device 10 may be replaced by or operate together with the voice service server 200 depending on the context.
  • FIGS. 8 to 11 are flowcharts illustrating a method of providing a voice recognition-based phototoon service according to an embodiment of the present disclosure.
  • Figures 12 to 14 are diagrams to explain a method of providing a phototoon service according to an embodiment of the present disclosure.
  • Figure 8 is described from the perspective of the processor 620 for convenience of explanation, but is not limited thereto.
  • the processor 620 may output video data through the output unit 150 (S101).
  • the processor 620 may detect an event (S103).
  • Events can represent various inputs, actions, etc. related to the Phototoon service.
  • the event may represent the reception of a user's phototoon service request signal through a remote control device (not shown).
  • the remote control device may include a remote control, a mobile device such as a smartphone or tablet PC installed with an application for data communication with the artificial intelligence device 10, an artificial intelligence speaker, etc.
  • This event may or may not occur while watching video data, for example in step S101.
  • an event may be provided as a menu item on the home menu or may occur through voice input in an any screen state (eg, a state in which a video is not playing). In this sense, step S101 may not be essential.
  • the artificial intelligence device 10 can provide a video list and provide a phototoon service for the selected video. These video lists may also include broadcast programs.
  • the processor 620 may extract image data in a predetermined unit (S105).
  • the predetermined unit may be any one of units such as a frame, scene, or sequence.
  • one may be a scene unit and the other may be a sequence unit.
  • the predetermined unit may represent, for example, a playback section arbitrarily set by the user.
  • the predetermined unit may represent, for example, a section in which an object selected by the user is output.
  • the object may be a concept including people, objects, etc.
  • only one person may be selected, and only the scene or section in which the selected person appears may be included in the predetermined unit.
  • the predetermined unit may be determined based on a theme, attribute, etc., rather than a physical object.
  • the artificial intelligence device 10 may set and provide cooking in a predetermined unit, that is, a theme, and provide the selection. Accordingly, only the sections related to cooking within the playback section of the target video can be extracted and used in the phototoon service.
  • the artificial intelligence device 10 extracts information for the phototoon service in predetermined units within the requested video playback section, but the requested video playback section does not necessarily need to be a continuous playback section.
  • the artificial intelligence device 10 may generate one phototoon service data based on a preset unit for each video. For example, if the theme of ‘cooking’ is set as a unit and multiple Phototoon service target videos are selected, a section related to cooking can be extracted from each target video to automatically generate one Phototoon service data.
  • the artificial intelligence device 10 can provide a list of currently playable videos regardless of the Phototoon service, and may also provide identification information about whether or not the Phototoon service is available for each video on the provided video list.
  • the processor 620 may extract voice data corresponding to the extracted predetermined unit of image data (S107).
  • the processor 620 may STT process the extracted corresponding voice data (S109).
  • the processor 620 can synthesize the extracted image data by aligning the converted voice data, that is, text data, so that they are in sync (S111).
  • the processor 620 may provide a phototoon service based on a synthetic image (S113).
  • the processor 620 detects a change in a predetermined unit within the video. For example, in FIG. 9, the processor 620 can detect (or sense) whether there is a scene change (S201).
  • Scene change detection may refer to either determining whether a scene change section exists in the target video or detecting data corresponding to the scene change section.
  • the predetermined unit can be automatically set based on the scene change section.
  • the scene change may be a section corresponding to the predetermined unit of Figure 8 described above.
  • the processor 620 may detect a main scene (or important scene) for each partial clip (S203).
  • the processor 620 may detect facial features in key scenes of each detected partial clip (S205).
  • the processor 620 may detect a representative scene based on the facial features of the main scene of each partial clip detected in step S205 (S207).
  • the processor 620 may extract voice data of a section corresponding to the representative scene detected in step S207 (S209).
  • the processor 620 may process STT conversion on the voice data extracted in step S209 (S211).
  • the processor 620 may synthesize the representative scene detected in step S207 and the STT-processed data in step S211 (S213).
  • the processor 620 can configure and provide a phototoon service using the synthesized data, that is, the phototoon composite data.
  • the method of providing phototoon services follows pre-set conditions, but can be changed arbitrarily.
  • the processor 620 may detect voice in the video playback section (S301).
  • the processor 620 may extract the section where voice is detected, that is, the voice section (S303).
  • step S301 may be omitted and integrated into step S303.
  • a predetermined unit can be automatically set based on voice section extraction.
  • audio may be output at a third viewpoint 1230 and again at a fourth viewpoint 1240. Therefore, only the scene at the time the voice is output can be extracted.
  • the processor 620 extracts a voice section in step S303, it can perform STT conversion on the voice data of the corresponding section (S305).
  • the processor 620 may detect face data on the frame in the section where voice data is extracted in step S303 (S307).
  • the processor 620 may extract facial features from the facial data detected in step S307 (S309).
  • the processor 620 may detect a representative scene based on the facial features extracted in step S309 (S311).
  • the processor 620 may combine the STT converted data in step S305 and the representative scene detected in step S311 into one image (S313).
  • Figures 11 and 14 describe, for example, a method of compositing images when providing a phototoon service.
  • the processor 620 may determine whether the face 1410 is detected, as shown in (a) of FIG. 14 (S401).
  • the processor 620 may determine whether the face size exceeds the threshold (S403).
  • the processor 620 may recognize the face direction as shown in (b) of FIG. 14 (S405).
  • step S405 the processor 620 can next recognize the mouth position as shown in (c) of FIG. 14 (S407).
  • the processor 620 may determine the location where the STT converted text information is output, that is, the location of the speech bubble 1430, based on the face direction recognized in step S405 and the mouth position recognized in step S407 (S409).
  • the processor 620 processes the speech balloon data and the image frame so that the speech balloons 1310 and 1430 are output at the corresponding location as shown in Figures 13 (a) and Figure 14 (c). can be combined into one image (S411).
  • step S401 determines in step S401 that no face is detected in the scene (or frame) or if the face size is less than the threshold in step S403, the processor 620 detects the face in the corresponding image as shown in (b) of FIG. 13. It can be combined with the corresponding scene or frame to be output as subtitles in a predetermined area 1320 (S413).
  • Figures 15, 16a, and 16b are diagrams to explain a method of providing a phototoon service using voice recognition technology according to an embodiment of the present disclosure.
  • Figures 15 (a) to (d) are diagrams illustrating a method of summarizing video data using, for example, a phototoon service.
  • the summary refers to only the main composite image among the composite images in which voice recognition-processed text information and corresponding image data are synthesized into one image based on one video data or a predetermined unit that is the target of the phototoon service within one video data. It can mean providing.
  • (a) to (d) of Figures 15 may represent images synthesized after STT conversion processing of voice data to a representative scene image of each scene unit within one video. At this time, an audio waveform is output at the bottom of each representative scene image, and location information of the current audio output can also be provided.
  • the artificial intelligence device 10 unfolds and provides composite images of the scene associated with (or mapped to) the corresponding composite image in a slide manner.
  • a video only composite images in video form
  • the artificial intelligence device 10 outputs a composite image corresponding to the voice location, and depending on the selection, after the location.
  • Composite images existing in can be played or provided sequentially.
  • the artificial intelligence device 10 displays at least two or more composite images (e.g., as shown in Figures 15 (a) and (c)) according to the user's selection. Can be played simultaneously. At this time, since text information is provided in the composite image itself, voice data can be muted.
  • the artificial intelligence device 10 guides the artificial intelligence device 10 to change and control the playback speed or size of the composite image when at least one image (1510 to 1540) or a voice waveform is long-clicked. Or it can be provided.
  • the artificial intelligence device 10 converts the entire video section into a predetermined unit, for example, a fitness routine. Accordingly, the composite images may be divided into a plurality of groups 1610, 1620, and 1630 (e.g., upper body fitness, lower body fitness, etc.), and synthetic images may be generated for each group.
  • groups 1610, 1620, and 1630 e.g., upper body fitness, lower body fitness, etc.
  • the artificial intelligence device 10 may provide summary data of the fitness video by providing composite images in groups, as shown in FIG. 16A.
  • the artificial intelligence device 10 can provide a summary service according to the phototoon service requested by the user, even for dramas and movies.
  • the artificial intelligence device 10 based on the actor such as the main character in each series or scene properties (e.g., action scene, drive scene, love scene) according to the user's phototoon service request. Accordingly, a composite image candidate image is extracted, corresponding audio data is extracted, and after STT processing, one image (synthetic image candidate image + speech bubble (converted text)) is synthesized and played sequentially or in a slide manner according to the playback order. If provided, they may be provided sequentially.
  • the phototoon summary service may be provided according to a phototoon service provision request or a separate phototoon summary request.
  • group may be defined differently depending on category, attribute, etc.
  • the artificial intelligence device 10 may operate as follows.
  • the artificial intelligence device 10 may provide a list of information or other synthetic images related to the object.
  • the artificial intelligence device 10 can re-perform the synthesis processing process for the Phototone service on the target video data based on the corresponding object and provide it.
  • the artificial intelligence device 10 performs a synthesis processing process for the phototoon service on the target video for user A, the main character, and provides a composite image.
  • the artificial intelligence device 10 collects and outputs only the composite image for User B among the composite images or outputs the target video.
  • the composite image can be provided by re-performing the composite processing process for the phototoon service based on user B.
  • the phototoon service according to the present disclosure can divide the target video into a section where the face is exposed and a section where the face is not exposed, and perform a compositing process only for the section where the face is exposed.
  • the phototoon service may perform a compositing process for each section, and construct and output a summary phototoon for each section.
  • a composite image in the phototoon service is created by combining a still image and text data.
  • the still image and text data may be data for a section that is in sync.
  • the composite image of the phototoon service is synthesized based on the image of the person's exposure, but the voice data only contains the voice even if the person is not exposed.
  • the audio data of the output image (scene) can also be combined with the image of the person in question after STT conversion to create a composite image.
  • the amount of composite images that make up the phototoon service may be determined to be proportional to the amount or playback time of the target video. For example, assuming that the target video is a 10-minute video and the amount of composite images is 10, if the target video is 30 minutes long, the amount of composite images may be 30. However, even in this case, if the playback time of the target video is above a certain level, it may be limited to the maximum amount of the predetermined composite image.
  • the phototoon service is provided by synthesizing voice recognition-based text conversion data with respect to video data, thereby expanding the usability of the system and improving or maximizing user satisfaction.
  • the present disclosure is not limited to this, and on the contrary, for data consisting of still image data and text, the phototoon service may be provided in the same way as video data by converting the text into speech based on speech recognition. The principle can be easily inferred by referring to the above-described embodiments.
  • a phototoon service can be provided for a desired portion of video data of a predetermined length, and multimedia functions can be provided in conjunction with various applications. You can.
  • the order of at least some of the operations disclosed in this disclosure may be performed simultaneously, may be performed in an order different from the previously described order, or some may be omitted/added.
  • the above-described method can be implemented as processor-readable code on a program-recorded medium.
  • media that the processor can read include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage devices.
  • the artificial intelligence device described above is not limited to the configuration and method of the above-described embodiments, but the embodiments are configured by selectively combining all or part of each embodiment so that various modifications can be made. It could be.
  • a phototoon service using voice recognition technology is provided for predetermined units of data constituting video data of various lengths, and video data summarized in phototoon is provided in a simple and simple manner. It has industrial applicability because it can maximize user satisfaction by providing a service that allows information to be easily recognized.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Sont divulgués un dispositif d'intelligence artificielle, et un procédé de commande associé. Le procédé de commande d'un dispositif d'intelligence artificielle selon au moins l'un des divers modes de réalisation de la présente invention peut comprendre les étapes consistant à : détecter un événement ; extraire au moins un élément de données d'image constituant les données vidéo en fonction de l'événement ; extraire des données vocales correspondant aux données d'image et effectuer un traitement STT ; synthétiser les données traitées par STT et les données d'image en une seule image ; et délivrer en sortie l'image synthétisée.
PCT/KR2022/016193 2022-10-21 2022-10-21 Dispositif d'intelligence artificielle, et procédé de commande associé WO2024085290A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/KR2022/016193 WO2024085290A1 (fr) 2022-10-21 2022-10-21 Dispositif d'intelligence artificielle, et procédé de commande associé

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/KR2022/016193 WO2024085290A1 (fr) 2022-10-21 2022-10-21 Dispositif d'intelligence artificielle, et procédé de commande associé

Publications (1)

Publication Number Publication Date
WO2024085290A1 true WO2024085290A1 (fr) 2024-04-25

Family

ID=90737875

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/016193 WO2024085290A1 (fr) 2022-10-21 2022-10-21 Dispositif d'intelligence artificielle, et procédé de commande associé

Country Status (1)

Country Link
WO (1) WO2024085290A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180133188A (ko) * 2017-06-05 2018-12-13 주식회사 토리웍스 모바일 웹툰 오픈 자동번역 서비스 제공 방법
KR102018331B1 (ko) * 2016-01-08 2019-09-04 한국전자통신연구원 음성 인식 시스템에서의 발화 검증 장치 및 그 방법
KR20210039583A (ko) * 2019-10-02 2021-04-12 에스케이텔레콤 주식회사 멀티모달 기반 사용자 구별 방법 및 장치
KR20210094323A (ko) * 2020-01-21 2021-07-29 엘지전자 주식회사 감성을 포함하는 음성을 제공하는 인공 지능 장치, 인공 지능 서버 및 그 방법
KR102302029B1 (ko) * 2020-11-23 2021-09-15 (주)펜타유니버스 인공지능 기반 복합 입력 인지 시스템

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102018331B1 (ko) * 2016-01-08 2019-09-04 한국전자통신연구원 음성 인식 시스템에서의 발화 검증 장치 및 그 방법
KR20180133188A (ko) * 2017-06-05 2018-12-13 주식회사 토리웍스 모바일 웹툰 오픈 자동번역 서비스 제공 방법
KR20210039583A (ko) * 2019-10-02 2021-04-12 에스케이텔레콤 주식회사 멀티모달 기반 사용자 구별 방법 및 장치
KR20210094323A (ko) * 2020-01-21 2021-07-29 엘지전자 주식회사 감성을 포함하는 음성을 제공하는 인공 지능 장치, 인공 지능 서버 및 그 방법
KR102302029B1 (ko) * 2020-11-23 2021-09-15 (주)펜타유니버스 인공지능 기반 복합 입력 인지 시스템

Similar Documents

Publication Publication Date Title
WO2017160073A1 (fr) Procédé et dispositif pour une lecture, une transmission et un stockage accélérés de fichiers multimédia
WO2020222444A1 (fr) Serveur pour déterminer un dispositif cible sur la base d'une entrée vocale d'un utilisateur et pour commander un dispositif cible, et procédé de fonctionnement du serveur
WO2018043991A1 (fr) Procédé et appareil de reconnaissance vocale basée sur la reconnaissance de locuteur
WO2018199390A1 (fr) Dispositif électronique
WO2019182226A1 (fr) Système de traitement de données sonores et procédé de commande dudit système
WO2019039834A1 (fr) Procédé de traitement de données vocales et dispositif électronique prenant en charge ledit procédé
WO2017039142A1 (fr) Appareil terminal d'utilisateur, système et procédé de commande associé
WO2013168970A1 (fr) Procédé et système d'exploitation de service de communication
WO2014107101A1 (fr) Appareil d'affichage et son procédé de commande
WO2014107097A1 (fr) Appareil d'affichage et procédé de commande dudit appareil d'affichage
WO2020196955A1 (fr) Dispositif d'intelligence artificielle et procédé de fonctionnement d'un dispositif d'intelligence artificielle
WO2014003283A1 (fr) Dispositif d'affichage, procédé de commande de dispositif d'affichage, et système interactif
WO2020218650A1 (fr) Dispositif électronique
WO2021045447A1 (fr) Appareil et procédé de fourniture d'un service d'assistant vocal
WO2019078615A1 (fr) Procédé et dispositif électronique pour traduire un signal vocal
WO2020230926A1 (fr) Appareil de synthèse vocale pour évaluer la qualité d'une voix synthétisée en utilisant l'intelligence artificielle, et son procédé de fonctionnement
WO2020050509A1 (fr) Dispositif de synthèse vocale
WO2019151802A1 (fr) Procédé de traitement d'un signal vocal pour la reconnaissance de locuteur et appareil électronique mettant en oeuvre celui-ci
WO2020226213A1 (fr) Dispositif d'intelligence artificielle pour fournir une fonction de reconnaissance vocale et procédé pour faire fonctionner un dispositif d'intelligence artificielle
WO2020263016A1 (fr) Dispositif électronique pour le traitement d'un énoncé d'utilisateur et son procédé d'opération
WO2020153717A1 (fr) Dispositif électronique et procédé de commande d'un dispositif électronique
WO2020218635A1 (fr) Appareil de synthèse vocale utilisant une intelligence artificielle, procédé d'activation d'appareil de synthèse vocale et support d'enregistrement lisible par ordinateur
WO2023085584A1 (fr) Dispositif et procédé de synthèse vocale
WO2019039873A1 (fr) Système et dispositif électronique pour générer un modèle de synthèse texte-parole
WO2020076089A1 (fr) Dispositif électronique de traitement de parole d'utilisateur et son procédé de commande