WO2024085290A1

WO2024085290A1 - Artificial intelligence device and operation method thereof

Info

Publication number: WO2024085290A1
Application number: PCT/KR2022/016193
Authority: WO
Inventors: 김성진; 허진영; 전영혁; 김중락; 허정; 이재훈
Original assignee: 엘지전자 주식회사
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2024-04-25

Abstract

Disclosed are an artificial intelligence device and an operation method thereof. The operation method of an artificial intelligence device according to at least one of various embodiments disclosed herein may comprise the steps of: detecting an event; extracting at least one piece of image data constituting the video data according to the event; extracting speech data corresponding to the image data and performing STT processing; synthesizing the STT-processed data and the image data into a single image; and outputting the synthesized image.

Description

Artificial intelligence devices and their operation methods

This disclosure relates to an artificial intelligence device that provides a phototoon service for a predetermined unit of data in video data and a method of operating the same.

With the rapid development of digital technology, demand and supply for video and multimedia have recently increased rapidly.

In addition, compared to before, the number of cases in which videos are watched or videos are provided as search results is increasing.

However, compared to text, it is impossible to immediately check the entire content of a video at once and the content can only be understood by playing the entire content. However, in general, compared to text, the time required to understand the content, that is, the video playback time, is relatively Because it is long, not only do you have to keep consuming unnecessary information, but it is also not easy to quickly obtain the information you want. Additionally, it is not easy to directly search for and move to desired information within the video.

For this reason, methods for providing a video summary are being studied, but since only the video data itself is provided in a summarized manner, there is a problem in that it is not easy to convey accurate information or understand the content from the video data itself.

The purpose of this disclosure is to provide an artificial intelligence device that provides a photo-toon service based on voice recognition technology and a method of operating the same.

A method of operating an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure includes: detecting an event; extracting at least one image data constituting the video data according to the event; Extracting voice data corresponding to the image data and STT processing it; combining the STT-processed data and the image data into one image; and outputting the synthesized image.

According to a method of operating an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, the event may include receiving a phototoon service request signal.

According to a method of operating an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, the at least one image data corresponds to any one of a frame, a scene, and a sequence unit that is a set of a plurality of scenes. It could be data.

According to a method of operating an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, the at least one image data may be determined based on an object in the video data.

According to a method of operating an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, the method includes detecting a face from the at least one image data; If the size of the detected face exceeds a threshold, recognizing the direction of the face; Recognizing the position of the mouth of the face; determining a position of a speech bubble that will contain the STT-processed data according to the direction of the face and the position of the mouth recognized with respect to the detected face; and compositing the image data so that a speech bubble containing the STT-processed data is positioned at the determined position of the speech balloon.

According to a method of operating an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, when a face is not detected from the at least one image data, the speech bubble is positioned to be output in one area of the screen. It may further include determining and compositing with the image data.

According to a method of operating an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, the at least one image data may correspond to a scene change section in the video or a sound output section. .

According to a method of operating an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, when there are a plurality of composite images for the video data, the plurality of composite images are grouped and summarized according to a predefined standard. Therefore, only some composite images may be output.

An artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure includes a display that outputs video data; and a processor that controls the display, wherein the processor detects an event, extracts at least one image data constituting the video data according to the event, and extracts voice data corresponding to the image data to perform STT. processing, the STT-processed data and the image data can be combined into one image to output a composite image.

According to an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, the event includes receiving a phototoon service request signal, and the at least one image data is a frame, a scene, and a set of a plurality of scenes. It may be data corresponding to any one of sequence units.

According to an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, the processor may determine the at least one image data based on an object in the video data.

According to an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, the processor detects a face from the at least one image data, and when the size of the detected face exceeds a threshold, By recognizing the direction of the face and the position of the mouth, the position of the speech bubble that will contain the STT-processed data is determined according to the direction of the face and the mouth position recognized for the detected face, and the STT-processed data is placed at the position of the determined speech bubble. It can be combined with the image data so that a speech bubble containing the data is positioned.

According to an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, the processor outputs the speech bubble to one area of the screen when a face is not detected from the at least one image data. The location can be determined and combined with the image data.

According to an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, the at least one image data may correspond to a scene change section in the video or a sound output section.

According to an artificial intelligence device according to at least one embodiment among various embodiments of the present disclosure, when there are a plurality of composite images for the video data, the processor groups the plurality of composite images according to predefined criteria and In summary, only some composite images can be output.

Further scope of applicability of the present invention will become apparent from the detailed description that follows. However, since various changes and modifications within the spirit and scope of the present invention may be clearly understood by those skilled in the art, the detailed description and specific embodiments such as preferred embodiments of the present invention should be understood as being given only as examples.

According to at least one embodiment among various embodiments of the present disclosure, the effect of increasing the utilization of artificial intelligence devices and increasing user satisfaction by providing a phototoon service for a desired portion (all or part) of video data There is.

According to at least one embodiment among various embodiments of the present disclosure, there is an effect of providing multimedia functions in conjunction with various applications.

1 is a diagram for explaining a voice system according to an embodiment of the present invention.

Figure 2 is a block diagram for explaining the configuration of an artificial intelligence device according to an embodiment of the present disclosure.

Figure 3 is a block diagram for explaining the configuration of a voice service server according to an embodiment of the present invention.

Figure 4 is a diagram illustrating an example of converting a voice signal into a power spectrum according to an embodiment of the present invention.

Figure 5 is a block diagram illustrating the configuration of a processor for voice recognition and synthesis of an artificial intelligence device, according to an embodiment of the present invention.

Figure 6 is a block diagram of a voice service system for providing a voice recognition-based phototoon service according to an embodiment of the present disclosure.

Figure 7 is a block diagram of the processor of Figure 6.

8 to 11 are flowcharts illustrating a method of providing a phototoon service according to the present disclosure.

Figures 12 to 14 are diagrams to explain a method of providing a phototoon service according to an embodiment of the present disclosure.

FIG. 15 is a diagram illustrating a method of providing a phototoon service using voice recognition technology according to an embodiment of the present disclosure.

Figures 16a and 16b are diagrams to explain a method of providing a phototoon service using voice recognition technology according to an embodiment of the present disclosure.

Hereinafter, embodiments disclosed in the present specification will be described in detail with reference to the attached drawings. However, identical or similar components will be assigned the same reference numbers regardless of reference numerals, and duplicate descriptions thereof will be omitted. The suffixes “module” and “part” for components used in the following description are given or used interchangeably only for the ease of preparing the specification, and do not have distinct meanings or roles in themselves. Additionally, in describing the embodiments disclosed in this specification, if it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed in this specification, the detailed descriptions will be omitted. In addition, the attached drawings are only for easy understanding of the embodiments disclosed in this specification, and the technical idea disclosed in this specification is not limited by the attached drawings, and all changes included in the spirit and technical scope of the present invention are not limited. , should be understood to include equivalents or substitutes.

Terms containing ordinal numbers, such as first, second, etc., may be used to describe various components, but the components are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

When a component is said to be "connected" or "connected" to another component, it is understood that it may be directly connected to or connected to the other component, but that other components may exist in between. It should be. On the other hand, when it is mentioned that a component is “directly connected” or “directly connected” to another component, it should be understood that there are no other components in between.

'Artificial intelligence devices' described in this specification include mobile phones, smart phones, laptop computers, artificial intelligence devices for digital broadcasting, personal digital assistants (PDAs), portable multimedia players (PMPs), navigation, and slates. PC (slate PC), tablet PC (tablet PC), ultrabook, wearable device (e.g., watch-type artificial intelligence device (smartwatch), glass-type artificial intelligence device (smart glass), HMD ( head mounted display)), etc. may be included.

However, artificial intelligence devices according to embodiments described in this specification may also be applied to fixed artificial intelligence devices such as smart TVs, desktop computers, digital signage, refrigerators, washing machines, air conditioners, and dishwashers.

Additionally, the artificial intelligence device 10 according to an embodiment of the present invention can also be applied to a fixed or movable robot.

Additionally, the artificial intelligence device 10 according to an embodiment of the present invention can perform the function of a voice agent. A voice agent may be a program that recognizes the user's voice and outputs a response appropriate for the recognized user's voice as a voice.

1 is a diagram for explaining a voice service system according to an embodiment of the present invention.

The voice service may include at least one of voice recognition and voice synthesis services. The speech recognition and synthesis process converts the speaker's (or user's) voice data into text data, analyzes the speaker's intention based on the converted text data, and converts the text data corresponding to the analyzed intention into synthesized voice data. , It may include a process of outputting the converted synthesized voice data.

For the voice recognition and synthesis process, a voice service system, as shown in Figure 1, can be used.

Referring to Figure 1, the voice service system includes an artificial intelligence device (10), a speech-to-text (STT) server (20), a Natural Language Processing (NLP) server (30), and a voice synthesis server ( 40) may be included. A plurality of AI agent servers 50-1 to 50-3 communicate with the NLP server 30 and may be included in the voice service system.

Meanwhile, the STT server 20, NLP server 30, and voice synthesis server 40 may exist as separate servers as shown, or may be included in one server. In addition, a plurality of AI agent servers 50-1 to 50-3 may also exist as separate servers or may be included in the NLP server 30.

The artificial intelligence device 10 may transmit a voice signal corresponding to the speaker's voice received through the microphone 122 to the STT server 20.

The STT server 20 can convert voice data received from the artificial intelligence device 10 into text data.

The STT server 20 can increase the accuracy of voice-to-text conversion by using a language model.

A language model can refer to a model that can calculate the probability of a sentence or the probability of the next word appearing given the previous words.

For example, the language model may include probabilistic language models such as Unigram model, Bigram model, N-gram model, etc.

The unigram model is a model that assumes that the usage of all words is completely independent of each other, and calculates the probability of a word string as the product of the probability of each word.

The bigram model is a model that assumes that the use of a word depends only on the previous word.

The N-gram model is a model that assumes that the usage of a word depends on the previous (n-1) words.

In other words, the STT server 20 can use the language model to determine whether text data converted from voice data has been appropriately converted, and through this, the accuracy of conversion to text data can be increased.

The NLP server 30 may receive text data from the STT server 20. The STT server 20 may be included in the NLP server 30.

The NLP server 30 may perform intent analysis on text data based on the received text data.

The NLP server 30 may transmit intention analysis information indicating the result of intention analysis to the artificial intelligence device 10.

The NLP server 30 may transmit intention analysis information to the voice synthesis server 40. The voice synthesis server 40 may generate a synthesized voice based on intent analysis information and transmit the generated synthesized voice to the artificial intelligence device 10.

The NLP server 30 may generate intention analysis information by sequentially performing a morpheme analysis step, a syntax analysis step, a dialogue act analysis step, and a dialogue processing step on text data.

The morpheme analysis step is a step that classifies text data corresponding to the voice uttered by the user into morpheme units, which are the smallest units with meaning, and determines what part of speech each classified morpheme has.

The syntax analysis step is a step that uses the results of the morpheme analysis step to classify text data into noun phrases, verb phrases, adjective phrases, etc., and determines what kind of relationship exists between each classified phrase.

Through the syntax analysis step, the subject, object, and modifiers of the voice uttered by the user can be determined.

The speech act analysis step is a step of analyzing the intention of the voice uttered by the user using the results of the syntax analysis step. Specifically, the speech act analysis step is to determine the intent of the sentence, such as whether the user is asking a question, making a request, or simply expressing an emotion.

The conversation processing step is a step that uses the results of the dialogue act analysis step to determine whether to reply to the user's utterance, respond to it, or ask a question for additional information.

After the conversation processing step, the NLP server 30 may generate intention analysis information including one or more of a response to the intention uttered by the user, a response, and an inquiry for additional information.

The NLP server 30 may transmit a search request to a search server (not shown) and receive search information corresponding to the search request in order to search for information that matches the user's utterance intention.

When the user's utterance intention is to search for content, the search information may include information about the searched content.

The NLP server 30 transmits search information to the artificial intelligence device 10, and the artificial intelligence device 10 can output the search information.

Meanwhile, the NLP server 30 may receive text data from the artificial intelligence device 10. For example, if the artificial intelligence device 10 supports a voice-to-text conversion function, the artificial intelligence device 10 converts voice data into text data and transmits the converted text data to the NLP server 30. .

The voice synthesis server 40 can generate a synthesized voice by combining pre-stored voice data.

The voice synthesis server 40 can record the voice of a person selected as a model and divide the recorded voice into syllables or words.

The voice synthesis server 40 can store the segmented voice in units of syllables or words in an internal or external database.

The voice synthesis server 40 may search for syllables or words corresponding to given text data from a database, synthesize a combination of the searched syllables or words, and generate a synthesized voice.

The voice synthesis server 40 may store a plurality of voice language groups corresponding to each of a plurality of languages.

For example, the speech synthesis server 40 may include a first audio language group recorded in Korean and a second audio language group recorded in English.

The speech synthesis server 40 may translate text data in the first language into text in the second language and generate synthesized speech corresponding to the translated text in the second language using the second speech language group.

The voice synthesis server 40 can transmit the generated synthesized voice to the artificial intelligence device 10.

The voice synthesis server 40 may receive analysis information from the NLP server 30. The analysis information may include information analyzing the intention of the voice uttered by the user.

The voice synthesis server 40 may generate a synthesized voice that reflects the user's intention based on the analysis information.

The functions of the STT server 20, NLP server 30, and voice synthesis server 40 described above may also be performed by the artificial intelligence device 10. For this purpose, the artificial intelligence device 10 may include one or more processors.

Each of the plurality of AI agent servers 50-1 to 50-3 may transmit search information to the NLP server 30 or the artificial intelligence device 10 according to a request from the NLP server 30.

If the intention analysis result of the NLP server 30 is a content search request, the NLP server 30 transmits the content search request to one or more of the plurality of AI agent servers 50-1 to 50-3, , content search results can be received from the corresponding server.

The NLP server 30 may transmit the received search results to the artificial intelligence device 10.

Figure 2 is a block diagram for explaining the configuration of an artificial intelligence device 10 according to an embodiment of the present disclosure.

Referring to FIG. 2, the artificial intelligence device 10 includes a communication unit 110, an input unit 120, a learning processor 130, a sensing unit 140, an output unit 150, a memory 170, and a processor 180. may include.

The communication unit 110 can transmit and receive data with external devices using wired and wireless communication technology. For example, the communication unit 110 may transmit and receive sensor information, user input, learning models, and control signals with external devices.

At this time, communication technologies used by the communication unit 110 include GSM (Global System for Mobile communication), CDMA (Code Division Multi Access), LTE (Long Term Evolution), LTV-A (advanced), 5G, WLAN (Wireless LAN), These include Wi-Fi (Wireless-Fidelity), Bluetooth™, RFID (Radio Frequency Identification), Infrared Data Association (IrDA), ZigBee, and NFC (Near Field Communication).

The input unit 120 can acquire various types of data.

The input unit 120 may include a camera for inputting video signals, a microphone for receiving audio signals, and a user input unit for receiving information from a user. Here, the camera or microphone may be treated as a sensor, and the signal obtained from the camera or microphone may be referred to as sensing data or sensor information.

The input unit 120 may acquire training data for model learning and input data to be used when obtaining an output using the learning model. The input unit 120 may acquire unprocessed input data, and in this case, the processor 180 or the learning processor 130 may extract input features by preprocessing the input data.

The input unit 120 may include a camera 121 for inputting video signals, a microphone 122 for receiving audio signals, and a user input unit 123 for receiving information from the user. there is.

Voice data or image data collected by the input unit 120 may be analyzed and processed as a user's control command.

The input unit 120 is for inputting image information (or signal), audio information (or signal), data, or information input from the user. To input image information, one or more artificial intelligence devices 10 are used. of cameras 121 may be provided.

The camera 121 processes image frames such as still images or moving images obtained by an image sensor in video call mode or shooting mode. The processed image frame may be displayed on the display unit 151 or stored in the memory 170.

The microphone 122 processes external acoustic signals into electrical voice data. Processed voice data can be used in various ways depending on the function (or application program being executed) being performed by the artificial intelligence device 10. Meanwhile, various noise removal algorithms may be applied to the microphone 122 to remove noise generated in the process of receiving an external acoustic signal.

The user input unit 123 is for receiving information from the user. When information is input through the user input unit 123, the processor 180 can control the operation of the artificial intelligence device 10 to correspond to the input information. there is.

The user input unit 123 is a mechanical input means (or mechanical key, such as a button, dome switch, jog wheel, jog switch, etc. located on the front/rear or side of the terminal 100) and It may include a touch input means. As an example, the touch input means consists of a virtual key, soft key, or visual key displayed on the touch screen through software processing, or a part other than the touch screen. It can be done with a touch key placed in .

The learning processor 130 can train a model composed of an artificial neural network using training data. Here, the learned artificial neural network may be referred to as a learning model. A learning model can be used to infer a result value for new input data other than learning data, and the inferred value can be used as the basis for a decision to perform an operation.

The learning processor 130 may include memory integrated or implemented in the artificial intelligence device 10. Alternatively, the learning processor 130 may be implemented using the memory 170, an external memory directly coupled to the artificial intelligence device 10, or a memory maintained in an external device.

The sensing unit 140 may use various sensors to obtain at least one of internal information of the artificial intelligence device 10, information about the surrounding environment of the artificial intelligence device 10, and user information.

At this time, the sensors included in the sensing unit 140 include a proximity sensor, illuminance sensor, acceleration sensor, magnetic sensor, gyro sensor, inertial sensor, RGB sensor, IR sensor, fingerprint recognition sensor, ultrasonic sensor, light sensor, microphone, and lidar. , radar, etc.

The output unit 150 may generate output related to vision, hearing, or tactile sensation.

The output unit 150 includes at least one of a display unit (Display Unit, 151), a sound output unit (152), a haptic module (153), and an optical output unit (Optical Output Unit, 154). It can be included.

The display unit 151 displays (outputs) information processed by the artificial intelligence device 10. For example, the display unit 151 may display execution screen information of an application running on the artificial intelligence device 10, or UI (User Interface) and GUI (Graphic User Interface) information according to such execution screen information.

The display unit 151 can implement a touch screen by forming a layered structure or being integrated with the touch sensor. This touch screen functions as a user input unit 123 that provides an input interface between the artificial intelligence device 10 and the user, and can simultaneously provide an output interface between the terminal 100 and the user.

The audio output unit 152 may output audio data received from the communication unit 110 or stored in the memory 170 in call signal reception, call mode or recording mode, voice recognition mode, broadcast reception mode, etc.

The sound output unit 152 may include at least one of a receiver, a speaker, and a buzzer.

The haptic module 153 generates various tactile effects that the user can feel. A representative example of a tactile effect generated by the haptic module 153 may be vibration.

The optical output unit 154 uses light from the light source of the artificial intelligence device 10 to output a signal to notify the occurrence of an event. Examples of events that occur in the artificial intelligence device 10 may include receiving a message, receiving a call signal, missed call, alarm, schedule notification, receiving email, receiving information through an application, etc.

The memory 170 can store data supporting various functions of the artificial intelligence device 10. For example, the memory 170 may store input data, learning data, learning models, learning history, etc. obtained from the input unit 120.

The processor 180 may determine at least one executable operation of the artificial intelligence device 10 based on information determined or generated using a data analysis algorithm or a machine learning algorithm. And the processor 180 can control the components of the artificial intelligence device 10 to perform the determined operation.

The processor 180 may request, retrieve, receive, or utilize data from the learning processor 130 or the memory 170, and may artificially execute an operation that is predicted or determined to be desirable among the at least one executable operation. Components of the intelligent device 10 can be controlled.

If linkage with an external device is necessary to perform a determined operation, the processor 180 may generate a control signal to control the external device and transmit the generated control signal to the external device.

The processor 180 may obtain intent information for user input and determine the user's request based on the obtained intent information.

The processor 180 may obtain intent information corresponding to the user input by using at least one of an STT engine for converting voice input into a character string or an NLP engine for obtaining intent information of natural language.

At least one of the STT engine and the NLP engine may be composed of at least a portion of an artificial neural network learned according to a machine learning algorithm. And, at least one of the STT engine or the NLP engine is learned by the learning processor 130, learned by the learning processor 240 of the AI server 200, or learned by distributed processing thereof. It could be.

The processor 180 collects history information including the user's feedback on the operation of the artificial intelligence device 10 and stores it in the memory 170 or the learning processor 130 or the AI server 200, etc. Can be transmitted to external devices. The collected historical information can be used to update the learning model.

The processor 180 may control at least some of the components of the artificial intelligence device 10 to run an application program stored in the memory 170. Furthermore, the processor 180 may operate two or more of the components included in the artificial intelligence device 10 in combination with each other in order to run the application program.

Figure 3 is a block diagram for explaining the configuration of the voice service server 200 according to an embodiment of the present invention.

The voice service server 200 may include one or more of the STT server 20, NLP server 30, and voice synthesis server 40 shown in FIG. 1. The voice service server 200 may be referred to as a server system.

Referring to FIG. 3, the voice service server 200 may include a preprocessor 220, a controller 230, a communication unit 270, and a database 290.

The preprocessing unit 220 may preprocess the voice received through the communication unit 270 or the voice stored in the database 290.

The preprocessing unit 220 may be implemented as a separate chip from the controller 230 or may be implemented as a chip included in the controller 230.

The preprocessor 220 may receive a voice signal (uttered by a user) and filter noise signals from the voice signal before converting the received voice signal into text data.

If the preprocessor 220 is provided in the artificial intelligence device 10, it can recognize a startup word for activating voice recognition of the artificial intelligence device 10. The preprocessor 220 converts the startup word received through the microphone 121 into text data, and if the converted text data is text data corresponding to a pre-stored startup word, it may be determined that the startup word has been recognized. .

The preprocessor 220 may convert the noise-removed voice signal into a power spectrum.

The power spectrum may be a parameter that indicates which frequency components and at what magnitude are included in the temporally varying waveform of a voice signal.

The power spectrum shows the distribution of squared amplitude values according to the frequency of the waveform of the voice signal.

This will be explained with reference to FIG. 4 .

Referring to Figure 4, a voice signal 410 is shown. The voice signal 410 may be received from an external device or may be a signal previously stored in the memory 170.

The x-axis of the voice signal 310 may represent time, and the y-axis may represent amplitude.

The power spectrum processor 225 may convert the voice signal 410, where the x-axis is the time axis, into a power spectrum 430, where the x-axis is the frequency axis.

The power spectrum processor 225 may convert the voice signal 410 into a power spectrum 430 using Fast Fourier Transform (FFT).

The x-axis of the power spectrum 430 represents frequency, and the y-axis represents the square value of amplitude.

Figure 3 will be described again.

The functions of the preprocessor 220 and the controller 230 described in FIG. 3 can also be performed by the NLP server 30.

The pre-processing unit 220 may include a wave processing unit 221, a frequency processing unit 223, a power spectrum processing unit 225, and an STT converting unit 227.

The wave processing unit 221 can extract the waveform of the voice.

The frequency processing unit 223 can extract the frequency band of the voice.

The power spectrum processing unit 225 can extract the power spectrum of the voice.

When a waveform that fluctuates in time is given, the power spectrum may be a parameter that indicates which frequency components and at what size are included in the waveform.

The STT converter 227 can convert voice into text.

The STT conversion unit 227 can convert voice in a specific language into text in that language.

The controller 230 can control the overall operation of the voice service server 200.

The controller 230 may include a voice analysis unit 231, a text analysis unit 232, a feature clustering unit 233, a text mapping unit 234, and a voice synthesis unit 235.

The voice analysis unit 231 may extract voice characteristic information using one or more of the voice waveform, voice frequency band, and voice power spectrum preprocessed in the preprocessor 220.

The voice characteristic information may include one or more of the speaker's gender information, the speaker's voice (or tone), the pitch of the sound, the speaker's speaking style, the speaker's speech speed, and the speaker's emotion.

Additionally, the voice characteristic information may further include the speaker's timbre.

The text analysis unit 232 may extract key expressions from the text converted by the speech-to-text conversion unit 227.

When the text analysis unit 232 detects a change in tone between phrases from the converted text, it can extract the phrase with a different tone as the main expression phrase.

The text analysis unit 232 may determine that the tone has changed when the frequency band between the phrases changes more than a preset band.

The text analysis unit 232 may extract key words from phrases in the converted text. A key word may be a noun that exists within a phrase, but this is only an example.

The feature clustering unit 233 can classify the speaker's speech type using the voice characteristic information extracted from the voice analysis unit 231.

The feature clustering unit 233 may classify the speaker's utterance type by assigning a weight to each type item constituting the voice characteristic information.

The feature clustering unit 233 can classify the speaker's utterance type using the attention technique of a deep learning model.

The text mapping unit 234 may translate the text converted into the first language into the text of the second language.

The text mapping unit 234 may map the text translated into the second language with the text of the first language.

The text mapping unit 234 can map key expressions constituting the text in the first language to corresponding phrases in the second language.

The text mapping unit 234 may map the utterance type corresponding to the main expression phrases constituting the text of the first language to phrases of the second language. This is to apply the classified utterance type to the phrases of the second language.

The voice synthesis unit 235 applies the utterance type and speaker's tone classified by the feature clustering unit 233 to the main expressions of the text translated into the second language in the text mapping unit 234, and creates a synthesized voice. can be created.

The controller 230 may determine the user's speech characteristics using one or more of the delivered text data or the power spectrum 430.

The user's speech characteristics may include the user's gender, the user's pitch, the user's tone, the user's speech topic, the user's speech speed, and the user's voice volume.

The controller 230 may use the power spectrum 430 to obtain the frequency of the voice signal 410 and the amplitude corresponding to the frequency.

The controller 230 can determine the gender of the user who uttered the voice using the frequency band of the power spectrum 430.

For example, if the frequency band of the power spectrum 430 is within the preset first frequency band range, the controller 230 may determine the user's gender as male.

If the frequency band of the power spectrum 430 is within the preset second frequency band range, the controller 230 may determine the user's gender as female. Here, the second frequency band range may be larger than the first frequency band range.

The controller 230 can determine the pitch of the voice using the frequency band of the power spectrum 430.

For example, the controller 230 may determine the pitch of the sound according to the size of the amplitude within a specific frequency band.

The controller 230 may determine the user's tone using the frequency band of the power spectrum 430. For example, the controller 230 may determine a frequency band with an amplitude greater than a certain level among the frequency bands of the power spectrum 430 as the user's main sound range, and determine the determined main sound range as the user's tone.

The controller 230 may determine the user's speech rate based on the number of syllables uttered per unit time from the converted text data.

The controller 230 can determine the topic of the user's speech using the Bag-Of-Word Model technique for the converted text data.

The Bag-Of-Word Model technique is a technique to extract frequently used words based on the frequency of words in a sentence. Specifically, the Bag-Of-Word Model technique is a technique that extracts unique words within a sentence and expresses the frequency of each extracted word as a vector to determine the characteristics of the topic of speech.

For example, if words such as <running>, <physical fitness>, etc. frequently appear in the text data of the controller 230, the topic of the user's speech may be classified as exercise.

The controller 230 can determine the topic of the user's speech from text data using a known text categorization technique. The controller 230 can extract keywords from text data and determine the topic of the user's speech.

The controller 230 can determine the user's voice volume by considering amplitude information in the entire frequency band.

For example, the user's voice quality can be determined based on the average or weighted average of the amplitude in each frequency band of the power spectrum of the controller 230.

The communication unit 270 may communicate with an external server by wire or wirelessly.

The database 290 may store the voice of the first language included in the content.

The database 290 may store a synthesized voice in which the voice of the first language is converted into the voice of the second language.

The database 290 may store a first text corresponding to a voice in the first language and a second text in which the first text is translated into the second language.

The database 290 may store various learning models required for voice recognition.

Meanwhile, the processor 180 of the artificial intelligence device 10 shown in FIG. 2 may include the preprocessor 220 and the controller 230 shown in FIG. 3.

That is, the processor 180 of the artificial intelligence device 10 may perform the functions of the preprocessor 220 and the controller 230.

Figure 5 is a block diagram illustrating the configuration of a processor for voice recognition and synthesis of the artificial intelligence device 10, according to an embodiment of the present invention.

That is, the voice recognition and synthesis process of FIG. 5 may be performed by the learning processor 130 or processor 180 of the artificial intelligence device 10 without going through the server.

Referring to FIG. 5, the processor 180 of the artificial intelligence device 10 may include an STT engine 510, an NLP engine 530, and a voice synthesis engine 550.

Each engine can be either hardware or software.

The STT engine 510 may perform the function of the STT server 20 of FIG. 1. That is, the STT engine 510 can convert voice data into text data.

The NLP engine 530 may perform the functions of the NLP server 30 of FIG. 1. That is, the NLP engine 530 can obtain intention analysis information indicating the speaker's intention from the converted text data.

The voice synthesis engine 550 may perform the function of the voice synthesis server 40 of FIG. 1.

The speech synthesis engine 550 may search a database for syllables or words corresponding to given text data, synthesize a combination of the searched syllables or words, and generate a synthesized voice.

The voice synthesis engine 550 may include a preprocessing engine 551 and a TTS engine 553.

The preprocessing engine 551 may preprocess text data before generating synthetic speech.

Specifically, the preprocessing engine 551 performs tokenization by dividing text data into tokens, which are meaningful units.

After performing tokenization, the preprocessing engine 551 may perform a cleansing operation to remove unnecessary characters and symbols to remove noise.

Afterwards, the preprocessing engine 551 can generate the same word token by integrating word tokens with different expression methods.

Afterwards, the preprocessing engine 551 may remove meaningless word tokens (stopwords).

The TTS engine 553 can synthesize speech corresponding to preprocessed text data and generate synthesized speech.

Hereinafter, various embodiments of providing phototoon services using voice service technology (e.g., voice recognition, voice synthesis, etc.) for video data of various lengths based on various platforms consumed by artificial intelligence devices will be described. do.

“Phototoon” described in this disclosure is a compound word of photo and toon, and is an image (still image or video format) for a desired portion (e.g., all or part) of video data provided through the artificial intelligence device 10. ) is acquired, the corresponding voice data is converted to text, and then a composite image (still image or video format) is displayed by combining the acquired image with the converted text. The process of creating and providing a phototoon for video data in the artificial intelligence device 10 is referred to as a ‘phototoon service’. However, the present disclosure is not limited to the above terms.

According to an embodiment of the present disclosure, the artificial intelligence device 10 can provide a summary service (summary or summary data) for desired portions of target video data through the phototoon service.

According to one embodiment, the phototoon service may be provided in such a way that the target video data is output as is, but the phototoon composite image is output only in a specific section, that is, the phototoon service section.

According to another embodiment, the phototoon service may be provided in such a way that a phototoon composite image for a specific section is generated separately from the playback of the target video, and only the phototoon service that outputs only the phototoon composite image is output.

Meanwhile, a plurality of phototoon service sections or a plurality of phototoon composite images may be generated and service provided for one target video.

The artificial intelligence device 10 may sense an event, for example, skip each phototoon service section according to the user's request through a remote control device, and provide a service to the user so that he or she can consume the target video.

Alternatively, when the phototoon service is activated or the phototoon service is requested through event detection, the artificial intelligence device 10 distinguishes sections (areas) available for phototoon service within the target video, and allows the user to identify and identify each divided phototoon service section. Can be provided for selection.

If there is a plurality of data consisting of phototoon services for the target video, the artificial intelligence device 10 can list them and provide them for selection, and output the selected phototoon service data.

In the present disclosure, phototoon composite data can be generated in units of desired sections. Here, the ‘desired section’ may represent, for example, a frame, a scene, or a sequence unit composed of a plurality of scenes. For example, even if a phototoon service is requested for a sequence unit, the artificial intelligence device 10 provides phototoon composite data only for some scene(s) (or main scenes), not all scenes constituting the sequence. can be created. However, it is not necessarily limited to the above contents.

Below, in relation to the Phototoon service, STT conversion technology based on voice recognition technology can be used.

Voice recognition technology may be processed by an STT engine (and NLP engine) provided in the artificial intelligence device 10, but is not necessarily limited to this. For example, voice recognition technology may be processed through the STT server 20 and NLP server 30 in the voice service server 200 and transmitted to the artificial intelligence device 10.

The artificial intelligence device (10) creates and provides a phototoon service menu item on the dashboard or menu of various artificial intelligence devices (10) so that users can easily enter and use the phototoon service, or provides an application dedicated to the phototoon service. It can be downloaded and installed for use. Alternatively, when an event request such as selection or playback of a video of a preset length or longer is received, the artificial intelligence device 10 may provide an icon or an OSD message (On Screen Display message) as a guide for using the phototoon service. there is.

The voice service server 200 can provide a phototoon service platform and can support or guide the use of the phototoon service for target video data in the form of a web service or web app through the artificial intelligence device 10.

FIG. 7 is a block diagram of the processor 620 of FIG. 6.

First, referring to FIG. 6, a voice service system for providing a phototoon service based on a voice recognition function may be configured to include an artificial intelligence device 10. Depending on the embodiment, the voice service server 200 may replace all or part of the functions related to the phototoon service of the artificial intelligence device 10.

The artificial intelligence device 10 may include an output unit 150 and a processing unit 600 that output phototoon service data and/or video data including phototoon service data.

The processing unit 600 may include a memory 610 and a processor 620.

The processor 620 controls the overall functions of the processing unit 600 and can perform operations to provide the phototoon service.

Referring to FIG. 7, the processor 620 includes a data reception unit 710, a detection unit 720, a voice recognition engine 730, a synthesis unit 740, and a control unit 750 to provide the phototoon service. You can. Here, at least one of the various components constituting the processor 620 may be implemented in the form of a plurality of modules, unlike shown. Depending on the embodiment, the processor 620 may further include at least one component not shown in FIG. 7.

The data receiver 710 may receive video data, identify a phototoon service request section (or a phototoon service-capable candidate section), and process the identified phototoon service-capable candidate sections by dividing them into predetermined units. The predetermined unit may be the above-mentioned frame unit, scene unit, sequence unit, etc. This distinction can be made only for the target video to which the phototoon service is applied or the phototoon service request section of the target video.

The detection unit 720 can detect phototoon service-related information for a predetermined unit within the target video data. The information detected in this way may include at least one of scene/sequence change information, main scene information, facial feature information, face-based representative scene information, and voice information.

The detection unit 720 may include a preprocessing module, a learning module, etc., and can automatically detect at least one of the above-described information by learning the generated artificial intelligence model related to the phototoon service.

The voice recognition engine 730 includes an STT engine and can convert voice information corresponding to image information detected through the detector 720 into text information. As described above, depending on the embodiment, the function of the voice recognition engine 730 may be performed by the STT server 20 in the voice recognition server 200, and in this case, in FIG. 7, the voice recognition engine 730 is It can be disabled or excluded from configuration.

The synthesis unit 740 can process and synthesize the image information detected by the detection unit 720 and the text information converted through the voice recognition engine 730 so that they are in sync.

The control unit 750 may control the overall operation and functions of the processor 620.

The control unit 750 can control each of the above components to provide the phototoon service according to the present disclosure to the target video.

Meanwhile, the processor 620 may have the same configuration as the processor 180 of FIG. 2, but may also have a separate configuration.

In the present disclosure, although it is described as the artificial intelligence device 10 for convenience of explanation, it may be replaced by or operate together with the voice service server 200 depending on the context.

8 to 11 are flowcharts illustrating a method of providing a voice recognition-based phototoon service according to an embodiment of the present disclosure.

Figure 8 is described from the perspective of the processor 620 for convenience of explanation, but is not limited thereto.

Referring to FIG. 8, the processor 620 may output video data through the output unit 150 (S101).

The processor 620 may detect an event (S103).

Events can represent various inputs, actions, etc. related to the Phototoon service. For example, the event may represent the reception of a user's phototoon service request signal through a remote control device (not shown). The remote control device may include a remote control, a mobile device such as a smartphone or tablet PC installed with an application for data communication with the artificial intelligence device 10, an artificial intelligence speaker, etc.

This event may or may not occur while watching video data, for example in step S101. In the latter case, as described above, an event may be provided as a menu item on the home menu or may occur through voice input in an any screen state (eg, a state in which a video is not playing). In this sense, step S101 may not be essential. In the latter case, when an event is detected, the artificial intelligence device 10 can provide a video list and provide a phototoon service for the selected video. These video lists may also include broadcast programs.

The processor 620 may extract image data in a predetermined unit (S105).

As described above, the predetermined unit may be any one of units such as a frame, scene, or sequence. Depending on the embodiment, there may be a plurality of predetermined units within one video data, and in this case, each unit may be different from each other. For example, one may be a scene unit and the other may be a sequence unit.

According to another embodiment, the predetermined unit may represent, for example, a playback section arbitrarily set by the user.

According to another embodiment, the predetermined unit may represent, for example, a section in which an object selected by the user is output. At this time, the object may be a concept including people, objects, etc. Meanwhile, if there are a plurality of people as objects in the video data, only one person may be selected, and only the scene or section in which the selected person appears may be included in the predetermined unit.

According to another embodiment, the predetermined unit may be determined based on a theme, attribute, etc., rather than a physical object. For example, when the artificial intelligence device 10 receives a request from the user to provide a phototoon service for a video containing a cooking scene and other scenes, the artificial intelligence device 10 may set and provide cooking in a predetermined unit, that is, a theme, and provide the selection. Accordingly, only the sections related to cooking within the playback section of the target video can be extracted and used in the phototoon service.

Meanwhile, the artificial intelligence device 10 extracts information for the phototoon service in predetermined units within the requested video playback section, but the requested video playback section does not necessarily need to be a continuous playback section.

When multiple videos are selected together as target videos for the phototoon service, the artificial intelligence device 10 may generate one phototoon service data based on a preset unit for each video. For example, if the theme of ‘cooking’ is set as a unit and multiple Phototoon service target videos are selected, a section related to cooking can be extracted from each target video to automatically generate one Phototoon service data.

Meanwhile, the artificial intelligence device 10 can provide a list of currently playable videos regardless of the Phototoon service, and may also provide identification information about whether or not the Phototoon service is available for each video on the provided video list.

The processor 620 may extract voice data corresponding to the extracted predetermined unit of image data (S107).

The processor 620 may STT process the extracted corresponding voice data (S109).

The processor 620 can synthesize the extracted image data by aligning the converted voice data, that is, text data, so that they are in sync (S111).

The processor 620 may provide a phototoon service based on a synthetic image (S113).

Next, with reference to (a) of FIGS. 9 and 12, a method of providing a phototoon service based on scene change will be described.

When a phototoon service is requested for a video, the processor 620 detects a change in a predetermined unit within the video. For example, in FIG. 9, the processor 620 can detect (or sense) whether there is a scene change (S201).

Scene change detection may refer to either determining whether a scene change section exists in the target video or detecting data corresponding to the scene change section.

Referring to (a) of FIG. 12, the predetermined unit can be automatically set based on the scene change section.

Referring to (a) of FIG. 12, it can be seen that one scene starts at the first viewpoint 1210 and another scene starts at the second viewpoint 1220.

In Figures 9 and 12(a), the scene change may be a section corresponding to the predetermined unit of Figure 8 described above.

The processor 620 may detect a main scene (or important scene) for each partial clip (S203).

The processor 620 may detect facial features in key scenes of each detected partial clip (S205).

The processor 620 may detect a representative scene based on the facial features of the main scene of each partial clip detected in step S205 (S207).

The processor 620 may extract voice data of a section corresponding to the representative scene detected in step S207 (S209).

The processor 620 may process STT conversion on the voice data extracted in step S209 (S211).

The processor 620 may synthesize the representative scene detected in step S207 and the STT-processed data in step S211 (S213).

The processor 620 can configure and provide a phototoon service using the synthesized data, that is, the phototoon composite data.

The method of providing phototoon services follows pre-set conditions, but can be changed arbitrarily.

Next, providing a phototoon service based on the audio output section will be described with reference to (b) of FIGS. 10 and 12.

The processor 620 may detect voice in the video playback section (S301).

If voice is detected within the video playback section through step S301, the processor 620 may extract the section where voice is detected, that is, the voice section (S303).

The above-described step S301 may be omitted and integrated into step S303.

Meanwhile, referring to (b) of FIG. 12, a predetermined unit can be automatically set based on voice section extraction.

Referring to (b) of FIG. 12, audio may be output at a third viewpoint 1230 and again at a fourth viewpoint 1240. Therefore, only the scene at the time the voice is output can be extracted.

If the processor 620 extracts a voice section in step S303, it can perform STT conversion on the voice data of the corresponding section (S305).

The processor 620 may detect face data on the frame in the section where voice data is extracted in step S303 (S307).

The processor 620 may extract facial features from the facial data detected in step S307 (S309).

The processor 620 may detect a representative scene based on the facial features extracted in step S309 (S311).

The processor 620 may combine the STT converted data in step S305 and the representative scene detected in step S311 into one image (S313).

Figures 11 and 14 describe, for example, a method of compositing images when providing a phototoon service.

Referring to FIGS. 11 and 14 , when compositing images, the processor 620 may determine whether the face 1410 is detected, as shown in (a) of FIG. 14 (S401).

If a face is detected as a result of determination in step S401, the processor 620 may determine whether the face size exceeds the threshold (S403).

If the processor 620 determines that the face size exceeds the threshold as a result of determination in step S403, the processor 620 may recognize the face direction as shown in (b) of FIG. 14 (S405).

If the face direction is recognized in step S405, the processor 620 can next recognize the mouth position as shown in (c) of FIG. 14 (S407).

The processor 620 may determine the location where the STT converted text information is output, that is, the location of the speech bubble 1430, based on the face direction recognized in step S405 and the mouth position recognized in step S407 (S409).

When the position of the speech bubble is determined through step S409, the processor 620 processes the speech balloon data and the image frame so that the speech balloons 1310 and 1430 are output at the corresponding location as shown in Figures 13 (a) and Figure 14 (c). can be combined into one image (S411).

Meanwhile, if the processor 620 determines in step S401 that no face is detected in the scene (or frame) or if the face size is less than the threshold in step S403, the processor 620 detects the face in the corresponding image as shown in (b) of FIG. 13. It can be combined with the corresponding scene or frame to be output as subtitles in a predetermined area 1320 (S413).

Figures 15, 16a, and 16b are diagrams to explain a method of providing a phototoon service using voice recognition technology according to an embodiment of the present disclosure.

Figures 15 (a) to (d) are diagrams illustrating a method of summarizing video data using, for example, a phototoon service.

Here, the summary refers to only the main composite image among the composite images in which voice recognition-processed text information and corresponding image data are synthesized into one image based on one video data or a predetermined unit that is the target of the phototoon service within one video data. It can mean providing.

For convenience, it is explained that (a) to (d) of Figure 15 are provided simultaneously through the output unit 150 of the artificial intelligence device 10.

Depending on the embodiment, (a) to (d) of Figures 15 may represent images synthesized after STT conversion processing of voice data to a representative scene image of each scene unit within one video. At this time, an audio waveform is output at the bottom of each representative scene image, and location information of the current audio output can also be provided.

When the composite images 1510 to 1540 are selected in (a) to 15 (d) of FIG. 15, the artificial intelligence device 10 unfolds and provides composite images of the scene associated with (or mapped to) the corresponding composite image in a slide manner. Alternatively, a video (only composite images in video form) can be played in that area.

Meanwhile, when the voice waveforms 1515 to 1545 are selected in (a) to 15 (d) of Figure 15, the artificial intelligence device 10 outputs a composite image corresponding to the voice location, and depending on the selection, after the location. Composite images existing in can be played or provided sequentially.

In a similar manner, when the voice waveforms 1515 to 1545 in Figures 15 (a) to 1545 are dragged and dropped into the image areas 1510 to 1540, the scene associated with the corresponding composite image is synthesized. Images may be played and provided sequentially.

In the screen of Figures 15 (a) to (d), the artificial intelligence device 10 displays at least two or more composite images (e.g., as shown in Figures 15 (a) and (c)) according to the user's selection. Can be played simultaneously. At this time, since text information is provided in the composite image itself, voice data can be muted.

According to another embodiment, the artificial intelligence device 10 guides the artificial intelligence device 10 to change and control the playback speed or size of the composite image when at least one image (1510 to 1540) or a voice waveform is long-clicked. Or it can be provided.

Referring to FIG. 16A, when a user requests a phototoon service for a target video, for example, video data about fitness, the artificial intelligence device 10 converts the entire video section into a predetermined unit, for example, a fitness routine. Accordingly, the composite images may be divided into a plurality of groups 1610, 1620, and 1630 (e.g., upper body fitness, lower body fitness, etc.), and synthetic images may be generated for each group.

If a phototoon summary service is separately requested by the user, the artificial intelligence device 10 may provide summary data of the fitness video by providing composite images in groups, as shown in FIG. 16A.

Referring to FIG. 16b, the artificial intelligence device 10 can provide a summary service according to the phototoon service requested by the user, even for dramas and movies. For example, in the case of a series drama, the artificial intelligence device 10 based on the actor such as the main character in each series or scene properties (e.g., action scene, drive scene, love scene) according to the user's phototoon service request. Accordingly, a composite image candidate image is extracted, corresponding audio data is extracted, and after STT processing, one image (synthetic image candidate image + speech bubble (converted text)) is synthesized and played sequentially or in a slide manner according to the playback order. If provided, they may be provided sequentially.

In this disclosure, the phototoon summary service may be provided according to a phototoon service provision request or a separate phototoon summary request.

In the above, group may be defined differently depending on category, attribute, etc.

13 to 16, when a specific object in the composite image is selected, the artificial intelligence device 10 may operate as follows.

The artificial intelligence device 10 may provide a list of information or other synthetic images related to the object.

The artificial intelligence device 10 can re-perform the synthesis processing process for the Phototone service on the target video data based on the corresponding object and provide it.

For example, let's say that the artificial intelligence device 10 performs a synthesis processing process for the phototoon service on the target video for user A, the main character, and provides a composite image. At this time, when User B, who is a supporting character, is output together with the provided composite image, and the user selects User B, who is a supporting character, the artificial intelligence device 10 collects and outputs only the composite image for User B among the composite images or outputs the target video. The composite image can be provided by re-performing the composite processing process for the phototoon service based on user B.

The phototoon service according to the present disclosure can divide the target video into a section where the face is exposed and a section where the face is not exposed, and perform a compositing process only for the section where the face is exposed.

Alternatively, the phototoon service according to the present disclosure may perform a compositing process for each section, and construct and output a summary phototoon for each section.

According to the present disclosure, a composite image in the phototoon service is created by combining a still image and text data. In this case, the still image and text data may be data for a section that is in sync. However, according to another embodiment, in the case of content where exposure of a person is relatively important, that is, a video, the composite image of the phototoon service is synthesized based on the image of the person's exposure, but the voice data only contains the voice even if the person is not exposed. The audio data of the output image (scene) can also be combined with the image of the person in question after STT conversion to create a composite image.

The amount of composite images that make up the phototoon service may be determined to be proportional to the amount or playback time of the target video. For example, assuming that the target video is a 10-minute video and the amount of composite images is 10, if the target video is 30 minutes long, the amount of composite images may be 30. However, even in this case, if the playback time of the target video is above a certain level, it may be limited to the maximum amount of the predetermined composite image.

According to at least one of the various embodiments of the present disclosure described above, the phototoon service is provided by synthesizing voice recognition-based text conversion data with respect to video data, thereby expanding the usability of the system and improving or maximizing user satisfaction. there is. However, the present disclosure is not limited to this, and on the contrary, for data consisting of still image data and text, the phototoon service may be provided in the same way as video data by converting the text into speech based on speech recognition. The principle can be easily inferred by referring to the above-described embodiments.

Descriptions related to methods, sequences, etc. shown in the present disclosure are not necessarily bound by the order shown in the drawings, and the order may be changed or performed simultaneously according to embodiments of the present disclosure. Additionally, not all operations or processes shown in the drawings are necessarily essential, and some operations or processes may be omitted or vice versa, depending on the embodiment.

As described above, according to at least one of the various embodiments of the present disclosure, a phototoon service can be provided for a desired portion of video data of a predetermined length, and multimedia functions can be provided in conjunction with various applications. You can.

Even if not specifically mentioned, the order of at least some of the operations disclosed in this disclosure may be performed simultaneously, may be performed in an order different from the previously described order, or some may be omitted/added.

According to an embodiment of the present invention, the above-described method can be implemented as processor-readable code on a program-recorded medium. Examples of media that the processor can read include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage devices.

The artificial intelligence device described above is not limited to the configuration and method of the above-described embodiments, but the embodiments are configured by selectively combining all or part of each embodiment so that various modifications can be made. It could be.

According to the artificial intelligence device and its operating method according to the present disclosure, a phototoon service using voice recognition technology is provided for predetermined units of data constituting video data of various lengths, and video data summarized in phototoon is provided in a simple and simple manner. It has industrial applicability because it can maximize user satisfaction by providing a service that allows information to be easily recognized.

Claims

detecting an event;

extracting at least one image data constituting the video data according to the event;

Extracting voice data corresponding to the image data and STT processing it;

combining the STT-processed data and the image data into one image; and

Including, outputting the synthesized image.

How artificial intelligence devices operate.
According to paragraph 1,

The event is,

Including receiving a phototoon service request signal,

How artificial intelligence devices operate.
According to paragraph 2,

The at least one image data is,

Data corresponding to any one of a frame, a scene, and a sequence unit that is a set of multiple scenes,

How artificial intelligence devices operate.
According to paragraph 2,

The at least one image data is,

Determined based on the object in the video data,

How artificial intelligence devices operate.
According to paragraph 1,

detecting a face from the at least one image data;

If the size of the detected face exceeds a threshold, recognizing the direction of the face;

Recognizing the position of the mouth of the face;

determining a position of a speech bubble that will contain the STT-processed data according to the direction of the face and the position of the mouth recognized with respect to the detected face; and

Further comprising combining the image data so that a speech bubble containing the STT-processed data is located at the determined location of the speech balloon.

How artificial intelligence devices operate.
According to clause 5,

If a face is not detected from the at least one image data, determining a position so that the speech bubble is output in one area of the screen and combining it with the image data; further comprising:

How artificial intelligence devices operate.
According to paragraph 1,

The at least one image data is,

Corresponding to the scene change section in the video or between audio output sections,

How artificial intelligence devices operate.
According to paragraph 1,

When there are a plurality of composite images for the video data, the plurality of composite images are grouped and summarized according to predefined criteria, and only some composite images are output.

How artificial intelligence devices operate.
A display that outputs video data; and

Including a processor that controls the display,

The processor,

Detects an event, extracts at least one image data constituting the video data according to the event, extracts voice data corresponding to the image data and performs STT processing, and combines the STT-processed data and the image data into one Combining images to output a composite image,

Artificial intelligence device.
According to clause 9,

The event includes receiving a phototoon service request signal,

The at least one image data is data corresponding to any one of a frame, a scene, and a sequence unit that is a set of a plurality of scenes,

Artificial intelligence device.
According to clause 10,

The processor,

Determining the at least one image data based on an object in the video data,

Artificial intelligence device.
According to clause 9,

The processor,

A face is detected from the at least one image data, and if the size of the detected face exceeds a threshold, the direction of the face and the mouth position are recognized, and the direction and mouth position of the face are recognized for the detected face. Accordingly, determining the position of a speech bubble containing the STT-processed data, and combining the image data so that a speech bubble containing the STT-processed data is located at the determined position of the speech balloon,

Artificial intelligence device.
According to clause 12,

The processor,

If a face is not detected from the at least one image data, determining a position so that the speech bubble is output in one area of the screen and combining it with the image data,

Artificial intelligence device.
According to clause 9,

The at least one image data is,

Corresponding to the scene change section in the video or between audio output sections,

Artificial intelligence device.
According to clause 9,

The processor,

When there are a plurality of composite images for the video data, the plurality of composite images are grouped and summarized according to predefined criteria, and only some composite images are output.

Artificial intelligence device.