WO2021157957A1 - Procédé et appareil de fourniture de signaux audio - Google Patents

Procédé et appareil de fourniture de signaux audio Download PDF

Info

Publication number
WO2021157957A1
WO2021157957A1 PCT/KR2021/001170 KR2021001170W WO2021157957A1 WO 2021157957 A1 WO2021157957 A1 WO 2021157957A1 KR 2021001170 W KR2021001170 W KR 2021001170W WO 2021157957 A1 WO2021157957 A1 WO 2021157957A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
message
user message
image
audio signal
Prior art date
Application number
PCT/KR2021/001170
Other languages
English (en)
Korean (ko)
Inventor
정크직미카엘
포들라스카카타르지나
루카시아크보제나
베크사카타르지나
부주노스키파웰
Original Assignee
삼성전자 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 삼성전자 주식회사 filed Critical 삼성전자 주식회사
Publication of WO2021157957A1 publication Critical patent/WO2021157957A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/07User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
    • H04L51/10Multimedia information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/725Cordless telephones

Definitions

  • the present disclosure relates to the field of audio signal processing. More specifically, the present disclosure relates to an apparatus and method for providing an audio signal suitable for a user message along with a user message.
  • Emoticons inserted into messages have evolved from simple symbols represented by a combination of punctuation marks to animated pictures representing complex situations.
  • a variety of emoticons facilitates communication and enriches simple, monotonous messages. Also, by associating an emoticon with an entire text segment, you can replace an entire text segment with a single click of an emoticon, or share your opinion on a text segment. Also, by using emoticons, meanings, contexts, and emotions that are difficult to express in text can be easily explained.
  • a data processing method and apparatus for generating a message including an audio signal according to an embodiment are provided.
  • a program for executing a data processing method for generating a message including an audio signal in a computer according to an embodiment of the present disclosure and a computer-readable recording medium in which the program is recorded are disclosed.
  • the steps of obtaining a user message including text, image, video, voice or a combination thereof determining an input vector to be input to a DNN from the user message, and inputting the input vector to the DNN , obtaining an output vector of the user message from the DNN, by comparing the output vector of the user message with an output vector of an audio signal stored in a database, the output vector of the user message and the plurality of audio signals in the database.
  • a data processing method comprising determining a similarity score between output vectors and, based on the similarity score, determining an audio signal of the user message from among a plurality of audio signals of the database.
  • a data processing apparatus configured with a memory device that stores a command in which each step of the data processing method provided in the present disclosure is implemented, and a processor that executes the command stored in the memory device.
  • a data processing method and apparatus for providing an audio signal corresponding to the content of the message together with the message are provided. Therefore, by conveying the message together with the audio signal, the meaning of the message and the feelings of the message creator can be effectively conveyed to the message recipient. In addition, due to the transmission of the audio signal, the message receiver can hear the audio signal first and guess the content of the message first.
  • FIG. 1 shows an embodiment of a method of matching a user message and an audio signal according to text vectorization.
  • FIG. 2 illustrates an embodiment of a method of matching a user message including a user's voice message with an audio signal.
  • FIG. 3 illustrates an embodiment of a method of matching a user message including a user's voice message with an audio signal.
  • FIG. 5 illustrates an embodiment to which a method of matching a user message and an audio signal during a phone call is applied.
  • FIG. 6 illustrates an embodiment of a method for selecting one of selectable candidate audio signals displayed on a mobile device.
  • FIG. 7 illustrates an embodiment of a method for processing an audio signal attached to an originator message in a receiver device.
  • FIG. 8 illustrates an embodiment of an IoT home appliance in which an audio signal of a user message is reproduced.
  • FIG. 9 illustrates an embodiment of a data processing apparatus performing a method of matching a user message with an audio signal.
  • FIG. 10 illustrates an embodiment of a data processing method for matching a user message with an audio signal.
  • a component when referred to as “connected” or “connected” with another component, the component may be directly connected or directly connected to the other component, but in particular It should be understood that, unless there is a description to the contrary, it may be connected or connected through another element in the middle.
  • each of the components expressed as ' ⁇ part (unit)', 'module', etc. two or more components are combined into one component, or one component is two or more for each more subdivided function. may be differentiated into.
  • each of the components to be described below may additionally perform some or all of the functions of other components in addition to the main functions that each component is responsible for, and some of the main functions of each of the components are different It goes without saying that it may be performed exclusively by the component.
  • FIG. 1 shows an embodiment of a method of matching a user message and an audio signal according to text vectorization.
  • an audio signal processing method of automatically analyzing a user message and selecting an audio signal corresponding to the user message from a database according to the analysis result is described.
  • the user message is analyzed according to steps 102-108.
  • the correlation between the audio signal and the content of the database is analyzed according to steps 110 to 114 .
  • step 116 the user message and the audio signal of the database are compared according to the analysis result of the user message and the correlation between the audio signal and the content of the database. And according to the comparison result, the audio signal most suitable for the user message is selected.
  • the image of the user message is analyzed.
  • the multi-mode message means a message including various types of content.
  • a multi-mode message may contain two or more text, image, video and audio content.
  • the meaning, context, emotion, etc. of the user message may be inferred from the image or video of the user message using the DNN model learned according to various contents and audio signals. And according to the inference, analysis information including the meaning, context, emotion, etc. of the user message is output.
  • the analysis information may include various output values for an image or video of a user message. Step 102 may be selectively applied when the user message includes an image or video, or a comment or description for the image or video is assigned to a sound item of the user message.
  • step 104 when the user message is a voice form or a multi-mode message including voice content, the user message may be converted into a text form according to automatic speech recognition (ASR). And according to the automatic voice recognition, analysis information including a text-type user message is output. Step 140 may be selectively applied when the user message is in the form of a voice.
  • ASR automatic speech recognition
  • step 106 analysis information of the user message output from steps 102 and 104 is input.
  • NLP Natural Language Processing
  • the NLP preprocessing may include text tokenization, text standardization, and/or text summarization.
  • various processes may be selectively used. For example, text correction in consideration of the basic form of a word and synonyms of a word, text correction using a word/phrase dictionary, and text correction using complex statistical/neural models may be used for text standardization.
  • NLP preprocessing may be performed for each predetermined video frame unit. For example, NLP preprocessing may be performed in units of 4 frames, units of 8 frames, or units of 16 frames. Since NLP preprocessing is performed for each predetermined video frame unit, analysis information of a user message can be efficiently summarized.
  • the NLP preprocessing of step 106 may optionally be applied.
  • step 108 the user message and/or the analysis information of the user message is processed.
  • the first user message "Hello?” and the response message "You had a wonderful walk on a sunny beach!” can be vectorized.
  • the vectorized initial user message and response message can be used as input to the neural network.
  • analysis information generated according to the analysis of image, video, and voice contents of the user message in steps 102 and 104 may be processed. Table 1 below describes parts of an input vector input to the DNN according to the type of user message.
  • Image caption text Image caption text Image caption text Image caption text Image caption text Image caption text Image caption text Image caption text Image caption text Image caption text Image caption text When you receive a text message in response to a previous text message sentence text sentence text context sentence text context sentence text If a multimode message is received in response to a previous text message sentence text Image caption text context sentence text context sentence text When you receive a text message in response to a previous image message sentence text sentence text Context image caption text Context image caption text When you receive an image message in response to a previous image message Image caption text Image caption text Context image caption text Context image caption text If you receive a text message in response to a previous multimode message sentence text sentence text context sentence text Context image caption text When you receive a response image message to a previous multimode message Image caption text Image caption text context sentence text Context image caption text If a multimode message is received in response to a previous multimode message sentence text Image caption text context sentence text Context image caption text If a multimode message is received in response to a previous multimode message sentence text Image caption text context
  • an input vector input to the DNN may include four parts. And four parts of the input vector are input to the DNN, thereby generating an output vector of a fixed length.
  • the input vector may include a sentence part, an image caption part, a context sentence part, and a context image caption part.
  • Components of the parts included in the input vector may vary depending on the type of the user message.
  • the component of a sentence part is basically determined by the sentence text originating from the sentence of the user message, but the user message is composed of only images.
  • the image caption text added to the image of the user message may be determined as a component of the sentence part.
  • the component of the image caption part is basically determined by the image caption text added to the image, but if the user message does not contain an image, the sentence text derived from the sentence of the user message may be determined as the component of the image caption part. there is.
  • Components of the context sentence part and the context image caption part may be determined from a previous message received to the user before the user message.
  • the user message may be generated by the user, depending on the context of a previous message received to the user prior to generating the user message. Therefore, by analyzing the context of the previous message together with the user message, the meaning of the user message can be inferred more precisely.
  • a component of the context sentence part may be determined by the context sentence text of the previous message.
  • the new context sentence text means the sentence text of the previous message.
  • the context of the sentence text of the user message may be derived. If the previous message consists only of images, the context image caption text added to the image of the previous message may be determined as a component of the context sentence part. If there is no previous message, the sentence text or image caption text of the user message may be determined as a component of the context sentence part according to the type of the user message. For example, when there is no previous message and the user message includes a sentence, the sentence text derived from the sentence of the user message may be determined as a component of the context sentence part. If there is no previous message and the user message consists only of an image, the image caption text added to the image of the user message may be determined as a component of the context sentence part.
  • a component of the context image caption part may be determined by the context image caption text of the previous message.
  • the context image caption text means the image caption text of the previous message.
  • the context of the image caption text of the user message may be derived. If the image is not included in the previous message, the context sentence text of the previous message may be determined as a component of the context image caption part. If there is no previous message, the sentence text or image caption text of the user message may be determined as a component of the context image caption part according to the type of the user message. For example, when there is no previous message and the user message includes a sentence, the sentence text derived from the sentence of the user message may be determined as a component of the context image caption part. If there is no previous message and the user message consists only of an image, the image caption text added to the image of the user message may be determined as a component of the context image caption part.
  • a caption summary summarizing the captions of the video frame or the plurality of images may be included in the input vector.
  • successive messages may be converted and processed into a single message of a predetermined context.
  • all four components of the input vector input to the DNN may be determined as sentence text converted into a single sentence in step 104 .
  • a plurality of input vectors for the text parts may be determined.
  • a plurality of output vectors may be generated.
  • a final output vector may be generated by concatenating a plurality of output vectors.
  • An output vector is generated as a result of the input vector input to the DNN.
  • the output vector may take the form of a hidden layer or a softmax layer expressed as a numeric vector. Also, the length of the output vector may be fixed.
  • Various DNN models can output semantic and sentiment vector representations of words/tags/sentences.
  • a pre-trained transfer learning architecture model similar to Embeddings from Language Models (ELMO) or Bidirectional Encoder Representations from Transformer (BERT) may be applied in step 108 .
  • ELMO Language Models
  • BERT Bidirectional Encoder Representations from Transformer
  • an additional deep learning model may be applied in step 108 .
  • steps 110 to 114 an output vector is derived from the audio signal input to the database.
  • Steps 110 to 114 may be performed independently of steps 102 to 108, 116 and 118.
  • steps 110 to 114 may be independently performed in the server storing the database before steps 102 to 108.
  • the audio signal is classified according to various criteria, based on frequency characteristics.
  • an audio signal may be classified according to physical characteristics such as autoregression, adaptive time-frequency decomposition, and short-time Fourier.
  • audio signals can be classified into cognitive characteristics such as brightness, tone, loudness, pitch, chroma, and harmonicity of sound through frequency analysis. .
  • Analysis information of the audio signal may be generated according to the physical and cognitive characteristics of the audio signal.
  • analysis information of the audio signal is output according to the analysis of the image or video.
  • the meaning, context, emotion, etc. of an image or video may be inferred using a DNN model learned according to various contents and audio signals.
  • analysis information including the meaning, context, emotion, etc. of the image or video is output.
  • the analysis information may include various output values for an image or video.
  • the image analysis of step 102 may be applied to the image analysis of step 110 .
  • step 112 analysis information of the image output from step 110 is input. And in step 112, NLP preprocessing is performed on the analysis information.
  • the NLP preprocessing may include text tokenization, text normalization, and/or text summarization.
  • the NLP preprocessing of step 106 may be applied to the NLP preprocessing of step 112 .
  • step 114 the analysis information or text of the image added to the audio signal is processed.
  • Parts of the input vector input to the DNN may be determined according to Table 1 presented above.
  • the method of determining the components of parts of the input vector of step 114 and the method of deriving the output vector from the input vector according to the DNN may be set the same as the method of step 108 .
  • the output vector generated for the audio signal is stored in a database.
  • step 116 the output vector of the user message of step 108 and the output vectors of the audio signals stored in the database of step 114 are compared.
  • the comparison is calculated based on output vectors in the form of fixed-length vectors.
  • One output vector of a user message may match a plurality of output vectors in the database.
  • a similarity measurement method eg the cosine similarity of a numeric vector, is calculated each time for both the output vector of the user message and the output vector of the audio signal in the database.
  • the similarity score according to the similarity measure is stored in memory for later selection in step 118 .
  • final matching may be performed through machine learning classification.
  • step 118 based on the similarity score between the output vector of the user message and the output vector of the audio signal in the database, an audio signal having an output vector most similar to the output vector of the user message is selected. And the selected audio signal may be transmitted to the user. Also, a simple mechanism for preventing the same audio signal from being provided to one user may be applied. Also, the personalization option may be applied to select an audio signal preferred by a user from among a plurality of audio signals. The personalization option may also be performed in step 116 to reduce the amount of computation.
  • FIG. 2 illustrates an embodiment of a method of matching a user message including a user's voice message with an audio signal.
  • the matching method of FIG. 2 is based on a method of multi-label classification of an audio signal into a plurality of classes.
  • Each class is a subject, sub-topic, or sentiment of an audio signal.
  • the audio signal is not limited to one class and may be classified into various classes.
  • an audio signal including the sound of flowing water may be classified into classes such as rest, calm, outdoor, and water.
  • the multi-label classification may be performed according to a pre-trained DNN. For example, when an audio feature extracted from an audio signal is input to the DNN, various classes of the audio signal may be determined according to the resulting DNN output.
  • voice messages included in user messages are also classified into class lists. According to the class, the voice message and the audio signal are matched.
  • the classification into various types of emotions may be implemented by applying a convolutional neural network (CNN) to available data such as a user's voice or face image.
  • CNN convolutional neural network
  • step 202 an audio feature of a voice message included in the user message is extracted.
  • step 204 according to the audio characteristics of the extracted voice message, the voice message is classified into various classes.
  • the classification is based on the output of the DNN generated by inputting the audio features of the extracted voice message into the DNN.
  • the classes according to the output of the DNN are divided according to themes and emotions.
  • step 206 an audio feature of the audio signal of the database is extracted.
  • the audio signal is classified into various classes.
  • the classification is based on the output of the DNN generated by inputting the audio features of the extracted audio signal into the DNN.
  • the classes according to the output of the DNN are divided according to themes and emotions.
  • step 210 the class of the voice message in step 204 and the class of the audio signal in step 208 are compared.
  • step 212 according to the comparison result of the classes, a similarity score of the audio signal to the voice message is determined. And according to the similarity score, an audio signal suitable for the voice message is determined.
  • FIG. 3 illustrates an embodiment of a method of matching a user message including a user's voice message with an audio signal.
  • the multi-label classification through the DNN of steps 204 and 208 is omitted, unlike the embodiment of FIG. 2 . Accordingly, instead of comparing the classes as in the embodiment of FIG. 2 , the audio characteristics of the voice message of the user message and the audio characteristics of the audio signal are directly compared.
  • step 302 an audio feature of the voice message included in the user message is extracted.
  • step 304 an audio feature of the audio signal of the database is extracted.
  • step 306 the audio characteristics of the voice message of step 302 and the audio characteristics of the audio signal of step 306 are compared.
  • step 308 according to the comparison result of the audio characteristics, a similarity score of the audio signal to the voice message is determined. And according to the similarity score, an audio signal suitable for the voice message is determined.
  • FIG. 4 shows an audio signal 400 input to a database.
  • the database of FIGS. 1 to 3 stores the analysis result of the input audio signal.
  • Annotations may be attached to the audio signal, and the annotations may include single text 402 , dialogue text 404 , images 406 , and moving images 408 .
  • the audio signal 400 is associated with a single text 402 , a dialogue text 404 , an image 406 , and a moving image 408 included in the annotation.
  • the output vector determined from the content included in the annotation may be stored together with the audio signal.
  • the output vector determined from the content included in the annotation is compared with the output vector of the user message of step 108 .
  • an audio signal corresponding to an optimal output vector stored in the database according to the comparison result of step 116 is matched with the user message.
  • FIG. 5 illustrates an embodiment to which a method of matching a user message and an audio signal during a phone call is applied.
  • the audio signal may be added to a voice message or a text message. Alternatively, the audio signal may be added during a phone call. During the conversation, the caller may select an audio signal that matches the topic of the conversation and send the audio signal to the receiver.
  • the caller may send a voice message saying “Let’s go to the movies” to the receiver during a call.
  • step 504 according to the voice message of the caller, selectable candidate audio signals are displayed on the mobile device, and according to the input of the caller, one of the candidate audio signals displayed on the mobile device is displayed together with the voice message of the user. is sent to And in step 506, the receiver can hear the audio signal selected by the sender after the voice message “Let's go to the movies”.
  • the receiver can select one of several options for the audio signal. For example, the recipient may select the option to cause the audio signal to be played immediately. In this case, the receiver can immediately hear the audio signal transmitted during the call after the caller's voice message. Alternatively, the receiver may select an option for the audio signal to be reproduced according to the receiver's selection. In this case, if the caller accepts the reproduction of the audio signal, the audio signal may be reproduced. In addition, the receiver can select an option to allow automatic playback of some audio signals and to play other audio signals only with the receiver's consent. For example, an audio signal of negative emotion or an audio signal that is too dynamic may be set to be played only with the acceptance of the receiver. The options given to the recipient may be applied to an audio signal of a text message or an image message as well as an audio signal of a voice message transmitted during a phone call.
  • FIG. 6 illustrates an embodiment of a method for selecting one of selectable candidate audio signals displayed on a mobile device.
  • the sender message includes a voice message 602 saying “let’s go to the swimming pool”.
  • the calling device 600 displays the candidate audio signals that are connected after the voice message 602 .
  • the candidate audio signals are determined according to the following two options.
  • the sender device 600 of FIG. 6 is depicted as being a mobile device, the sender device 600 may be any other electronic device that includes a display.
  • the standard option 604 provides certain audio signals in a simple drop-down menu.
  • the predetermined audio signals may be provided independently of the voice message 602 .
  • the ASR matching option 606 provides the audio signal adaptively determined to the content of the voice message 602 .
  • an audio signal related to swimming such as a waterfall sound
  • the audio signal according to the ASR matching option 606 may be provided together with the associated image. The user can manually select one of the provided candidate audio signals according to the standard option 604 and the ASR matching option 606 of the drop down menu.
  • FIG. 7 illustrates an embodiment of a method for processing an audio signal attached to an originator message in a receiver device.
  • the audio signal is reproduced.
  • the audio signal of the sender message is used as an arrival alarm of the sender message.
  • the third embodiment 704 of FIG. 7 an audio signal is reproduced while text and an image included in the sender message are displayed.
  • the fourth embodiment 706 of FIG. 7 when the screen of the receiver device is locked, a part of the text included in the sender message is displayed on the screen, and the audio signal of the sender message is an arrival alarm of the sender message. used
  • the audio signals of the first embodiment 700 to the fourth embodiment 706 may be selected by the caller.
  • the audio signal may be selected from candidate audio signals adaptively determined in response to an originator message.
  • the audio signal may be determined at the receiver device based on an analysis of the content and emotion of the sender message, regardless of the sender.
  • FIG. 8 illustrates an embodiment of an IoT home appliance in which an audio signal of a user message is reproduced.
  • the IoT home appliance device may play an audio signal along with a user message.
  • the first embodiment 800 represents an audio signal provided together with an alarm input to the calendar of the IoT refrigerator.
  • a user may input a specific event into the calendar application. And the user can set the alarm of a specific event to be notified by the IoT refrigerator.
  • an event to feed a dog is input to the IoT refrigerator at 9 am on Sunday, and a dog sound corresponding to a text message included in the event is determined as an audio signal.
  • the event indication may be shown on the display of the IoT refrigerator, along with the dog's sound at 9 am on Sunday.
  • the second embodiment 802 represents an audio signal in which the IoT TV notifies the operation state of the task of another IoT home appliance.
  • the user may instruct the robot vacuum cleaner to clean the floor of the house.
  • the robot cleaner may transmit a cleaning completion message to the IoT TV that the user is watching. Then, the audio signal for the cleaning completion message selected by the robot cleaner or the IoT TV is reproduced on the IoT TV.
  • the first embodiment 800 and the second embodiment 802 are only examples, and audio signals for various messages displayed or reproduced in the IoT home appliance device may be adaptively determined in response to the message.
  • the audio signal corresponding to the message in FIGS. 5 to 8 may be determined adaptively to the message according to the method of matching the message and the audio signal of FIGS. 1 to 3 .
  • FIG. 9 shows an embodiment of a data processing apparatus 900 that performs a method of matching a user message with an audio signal.
  • the data processing apparatus 900 may include a processor 902 and a memory 904 .
  • the processor 902 may control the data processing apparatus 900 as a whole.
  • the processor 902 may execute one or more programs stored in the memory 904 .
  • the memory 904 may store various data, programs, or applications for driving and controlling the data processing apparatus 900 .
  • a program stored in memory 904 may include one or more instructions.
  • a program (one or more instructions) or an application stored in the memory 904 may be executed by the processor 902 .
  • a user message can be obtained that includes text, image, video, voice, or a combination thereof.
  • a previous message received by the user before the user message may be obtained by the processor 902 . Since the user message is written in response to the previous message, the context of the user message may be analyzed by analyzing the previous message.
  • the image caption text of the user message may be determined from the image or video of the user message by the processor 902 .
  • a plurality of representative frames may be extracted from the video for each predetermined video frame unit.
  • the image caption text of the user message may be determined from the plurality of representative frames.
  • the voice of the user message may be converted into voice text according to automatic voice recognition.
  • the text of the user message including the voice text when the length of the text of the user message including the voice text is greater than or equal to a predetermined length, the text of the user message may be summarized to be less than or equal to the predetermined length.
  • the processor 902 may determine an input vector to be input to the DNN from the user message. In addition to the user message, an input vector to be input to the DNN may be determined from a previous message.
  • it is composed of at least one of a sentence part, an image caption part, a context sentence part, and a context image caption part.
  • the sentence part is composed of sentence text in which the text of the user message is summarized. If the user message does not include text but contains an image or video, the sentence part may be composed of image caption text.
  • the image caption part is composed of image caption text describing an image or video of the user message.
  • the image caption part may be composed of sentence text.
  • the context sentence part is composed of context sentence text in which text of a previous message received to the user before the user message is summarized.
  • the context sentence part may be composed of sentence text or image caption text of the user message.
  • the previous message does not contain text and the context sentence part contains an image or video, it may consist of context image caption text.
  • the context image caption part consists of context image caption text describing an image or video of a previous message.
  • the context image caption part may be composed of sentence text or image caption text of the user message.
  • the context image caption part may be composed of context sentence text.
  • an input vector is input to the DNN, whereby an output vector of a user message may be obtained from the DNN.
  • the output vector of the user message is compared with the output vector of the audio signal stored in the database, whereby a similarity score between the output vector of the user message and the output vectors of the plurality of audio signals in the database is determined.
  • the database includes an output vector generated by inputting an input vector according to analysis information of an audio signal to the DNN.
  • the analysis information may be determined based on a frequency characteristic of an audio signal and an analysis result of text, image, video, voice, or a combination thereof connected to the audio signal.
  • the processor 902 determines, based on the similarity score, an audio signal of the user message from among a plurality of audio signals of the database.
  • the processor 902 may determine the audio signal of the user message according to the user's preference among a plurality of candidate audio signals having a high similarity score.
  • the technical features related to the method of matching the user message and the audio signal described with reference to FIGS. 1 to 8 may be applied to the data processing apparatus 900 of FIG. 9 .
  • FIG. 10 illustrates an embodiment of a data processing method 1000 for matching a user message with an audio signal.
  • a user message is obtained that includes text, image, video, voice, or a combination thereof. Also, a previous message received to the user before the user message may be obtained.
  • the image caption text of the user message may be determined from the image or video of the user message. Also, when a video is included in the user message, a plurality of representative frames may be extracted from the video for each predetermined video frame unit. In addition, the image caption text of the user message may be determined from the plurality of representative frames.
  • the voice of the user message when the user message includes voice, the voice of the user message may be converted into voice text according to automatic voice recognition.
  • the length of the text of the user message including the voice text is greater than or equal to a predetermined length
  • the text of the user message may be summarized to be less than or equal to the predetermined length.
  • an input vector to be input to the DNN may be determined from the user message.
  • an input vector to be input to the DNN may be determined from a previous message.
  • step 1006 an input vector is input to the DNN, whereby an output vector of a user message may be obtained from the DNN.
  • step 1008 the output vector of the user message is compared with the output vector of the audio signal stored in the database, whereby a similarity score between the output vector of the user message and the output vectors of the plurality of audio signals in the database is determined.
  • an audio signal of a user message is determined from among a plurality of audio signals of a database based on the similarity score.
  • the audio signal of the user message may be determined according to a user's preference among a plurality of candidate audio signals having a high similarity score.
  • each step of the data processing method 1000 of FIG. 10 can be written as a program that can be executed in a computer.
  • each step of the data processing method 1000 of FIG. 10 may be implemented in a general-purpose digital computer that operates the program using a computer-readable recording medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

L'invention concerne un procédé de traitement de données comprenant les étapes consistant : à acquérir un message d'utilisateur comprenant du texte, une image, un contenu vidéo, un contenu vocal ou une combinaison de ces derniers ; à déterminer, à partir du message d'utilisateur, un vecteur d'entrée à entrer dans un RNP ; à acquérir un vecteur de sortie du message d'utilisateur à partir du RNP par entrée du vecteur d'entrée dans le RNP ; à déterminer un score de similarité entre le vecteur de sortie du message d'utilisateur et des vecteurs de sortie d'une pluralité de signaux audio stockés dans une base de données par comparaison du vecteur de sortie du message d'utilisateur avec un vecteur de sortie d'un signal audio stocké dans la base de données ; et sur la base du score de similarité, à déterminer un signal audio pour le message d'utilisateur parmi la pluralité de signaux audio stockés dans la base de données.
PCT/KR2021/001170 2020-02-07 2021-01-28 Procédé et appareil de fourniture de signaux audio WO2021157957A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2020-0015138 2020-02-07
KR1020200015138A KR20210101374A (ko) 2020-02-07 2020-02-07 오디오 신호 제공 방법 및 장치

Publications (1)

Publication Number Publication Date
WO2021157957A1 true WO2021157957A1 (fr) 2021-08-12

Family

ID=77200201

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2021/001170 WO2021157957A1 (fr) 2020-02-07 2021-01-28 Procédé et appareil de fourniture de signaux audio

Country Status (2)

Country Link
KR (1) KR20210101374A (fr)
WO (1) WO2021157957A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170100175A (ko) * 2016-02-25 2017-09-04 삼성전자주식회사 전자 장치 및 전자 장치의 동작 방법
US20180144208A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Adaptive attention model for image captioning
KR20190094315A (ko) * 2019-05-17 2019-08-13 엘지전자 주식회사 스타일을 고려하여 텍스트와 음성을 상호 변환하는 인공 지능 장치 및 그 방법
KR20190100095A (ko) * 2019-08-08 2019-08-28 엘지전자 주식회사 음성 처리 방법 및 음성 처리 장치
KR20190127202A (ko) * 2018-05-03 2019-11-13 주식회사 케이티 스토리 컨텐츠에 대한 음향 효과를 제공하는 미디어 재생 장치 및 음성 인식 서버

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170100175A (ko) * 2016-02-25 2017-09-04 삼성전자주식회사 전자 장치 및 전자 장치의 동작 방법
US20180144208A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Adaptive attention model for image captioning
KR20190127202A (ko) * 2018-05-03 2019-11-13 주식회사 케이티 스토리 컨텐츠에 대한 음향 효과를 제공하는 미디어 재생 장치 및 음성 인식 서버
KR20190094315A (ko) * 2019-05-17 2019-08-13 엘지전자 주식회사 스타일을 고려하여 텍스트와 음성을 상호 변환하는 인공 지능 장치 및 그 방법
KR20190100095A (ko) * 2019-08-08 2019-08-28 엘지전자 주식회사 음성 처리 방법 및 음성 처리 장치

Also Published As

Publication number Publication date
KR20210101374A (ko) 2021-08-19

Similar Documents

Publication Publication Date Title
WO2019225837A1 (fr) Procédé d'apprentissage de vocabulaire personnalisé inter-domaines et dispositif électronique associé
JP6233798B2 (ja) データを変換する装置及び方法
EP2891084A1 (fr) Dispositif d'affichage et procédé de recherche vocale
WO2020204655A1 (fr) Système et procédé pour un réseau de mémoire attentive enrichi par contexte avec codage global et local pour la détection d'une rupture de dialogue
WO2019088384A1 (fr) Procédé de fourniture de conversation en langage naturel à expression riche par modification de réponse, dispositif informatique et support d'enregistrement lisible par ordinateur
WO2014106986A1 (fr) Appareil électronique commandé par la voix d'un utilisateur et procédé pour le commander
WO2021132802A1 (fr) Appareil de recherche de vidéo utilisant des critères multimodaux et procédé associé
EP3031213A1 (fr) Appareil, serveur et procédé pour fournir un sujet de conversation
CN106776872A (zh) 根据语音定义语意进行语音搜索的方法及系统
WO2019147039A1 (fr) Procédé de détermination d'un motif optimal de conversation pour la réalisation d'un objectif à un instant particulier pendant une session de conversation associée à un système de service d'ia de compréhension de conversation, procédé de détermination de probabilité de prédiction d'accomplissement d'objectif et support d'enregistrement lisible par ordinateur
WO2018169276A1 (fr) Procédé pour le traitement d'informations de langue et dispositif électronique associé
CN114464180A (zh) 一种智能设备及智能语音交互方法
WO2019156536A1 (fr) Procédé et dispositif informatique pour construire ou mettre à jour un modèle de base de connaissances pour un système d'agent ia interactif en marquant des données identifiables mais non apprenables, parmi des données d'apprentissage, et support d'enregistrement lisible par ordinateur
WO2015102125A1 (fr) Système et procédé de conversation de texto
JP2013054417A (ja) コンテンツに対するタグ付けプログラム、サーバ及び端末
WO2021157957A1 (fr) Procédé et appareil de fourniture de signaux audio
WO2019168235A1 (fr) Procédé et système d'agent d'ia interactif permettant de fournir une détermination d'intention en fonction de l'analyse du même type de multiples informations d'entité, et support d'enregistrement lisible par ordinateur
WO2020235910A1 (fr) Système de reconstruction de texte et procédé associé
WO2021071271A1 (fr) Appareil électronique et procédé de commande associé
WO2021118050A1 (fr) Programme informatique d'édition automatique de vidéo de mises en évidence
WO2021066399A1 (fr) Système d'assistant vocal basé sur une intelligence artificielle réaliste utilisant un réglage de relation
WO2020091431A1 (fr) Système de génération de sous-titres utilisant un objet graphique
WO2023167496A1 (fr) Procédé de composition de musique utilisant l'intelligence artificielle
WO2012057561A2 (fr) Système et procédé pour fournir un service de messagerie instantanée, et terminal de communication et procédé de communication associés
WO2023146030A1 (fr) Dispositif, procédé et programme d'interaction basés sur l'intelligence artificielle et intégrant une émotion, un degré de concentration et une conversation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21751242

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21751242

Country of ref document: EP

Kind code of ref document: A1