WO2023128586A1 - Artificial intelligence-based dialogue situation prediction and intention classification system, and method thereof - Google Patents

Artificial intelligence-based dialogue situation prediction and intention classification system, and method thereof Download PDF

Info

Publication number
WO2023128586A1
WO2023128586A1 PCT/KR2022/021461 KR2022021461W WO2023128586A1 WO 2023128586 A1 WO2023128586 A1 WO 2023128586A1 KR 2022021461 W KR2022021461 W KR 2022021461W WO 2023128586 A1 WO2023128586 A1 WO 2023128586A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
unit
voice
utterance
text
Prior art date
Application number
PCT/KR2022/021461
Other languages
French (fr)
Korean (ko)
Inventor
정호영
김준우
윤혜경
윤은지
Original Assignee
경북대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020220041966A external-priority patent/KR20230100543A/en
Application filed by 경북대학교 산학협력단 filed Critical 경북대학교 산학협력단
Publication of WO2023128586A1 publication Critical patent/WO2023128586A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present invention relates to a conversation situation prediction and intention classification system and method based on artificial intelligence.
  • an artificial intelligence speaker can recognize a voice provided by a user and generate and output a response based on a built-in algorithm. Users can conveniently access various information using artificial intelligence speakers.
  • the artificial intelligence speaker cannot accurately recognize the user's voice and may provide inaccurate information to the user.
  • the present invention is a Ministry of Science and ICT Ministry of Information, Communication and Broadcasting Innovation Talent Fostering (R & D) (Task identification number: 1711125907, task number: 2020-0-01808-002, research task name: complex information-based predictive intelligence innovative technology research, It was derived from a study conducted as part of the project management agency: Information and Communications Technology Planning and Evaluation Institute, project executing agency: Kyungpook National University Industry-University Cooperation Foundation, research period: 2021.01.01. ⁇ 2021.12.31.). Meanwhile, there is no property interest of the Korean government in any aspect of the present invention.
  • a technical problem to be solved by the present invention is to provide a conversation situation prediction and intention classification system and method based on artificial intelligence that recognizes a user's voice according to the user's gender and age.
  • a technical problem to be solved by the present invention is to provide a conversation situation prediction and intention classification system and method based on artificial intelligence for determining a user's speech intention based on a user's voice.
  • a technical problem to be solved by the present invention is to provide a dialogue situation prediction and intention classification system and method based on artificial intelligence that predicts the user's speech after the user's speech based on the user's speech intention.
  • a technical problem to be solved by the present invention is to provide a dialogue situation prediction and intention classification system and method based on artificial intelligence that generates a response based on the user's speech intention and the predicted user's speech.
  • the at least one processor determines the user's request utterance from one point in time to another point in time.
  • a voice determination unit that determines the gender and age of the user
  • a voice processing unit that converts the user's requested utterance into text to determine the user's utterance intention, and determines the user's predicted utterance after another point in time
  • the user's utterance intention and predicted prediction and a response generation unit that generates a response to the user's requested utterance based on the utterance.
  • the voice determination unit for extracting the first voice data based on the requested utterance at a certain point in time, the first voice data as an input value of the first decision algorithm stored in advance
  • a first deep learning unit that calculates the user's gender and age as a first probability value and determines the user's gender and age based on the first probability value
  • a voice recognizer distribution unit recognizing the requested utterance and extracting second voice data.
  • the first deep learning unit determines the gender and age corresponding to the highest probability value among the first probability values as the gender and age of the user.
  • the voice recognizer distribution unit includes a plurality of different voice recognizers for recognizing a user's requested utterance and generating second voice data corresponding to the determined gender and age.
  • a voice-to-text conversion unit for converting the second voice data into a first converted text, and inputting the first converted text as an input value of a first deep learning algorithm stored in advance
  • a second deep learning unit that determines the first request text corresponding to the user's requested utterance, and inputs the first request text as an input value of a pre-stored intention classification algorithm to generate speech intention data for the utterance intention.
  • a speech intention prediction unit for generating predicted speech data for predicted speech by inputting the degree classification unit and speech intention data as input values of an intention prediction algorithm stored in advance.
  • the second deep learning unit inputs the converted text input unit that receives the first converted text, and inputs the first converted text as an input value of the first deep learning algorithm stored in advance to obtain the first request text. It includes a first voice model deep learning unit to generate and a request text output unit to output a first request text.
  • the second deep learning unit includes a reference text storage unit for storing a first reference text corresponding to a user's requested utterance, and a first error rate value between the first converted text and the first requested text. It further includes a first error rate calculation unit that calculates.
  • the second deep learning unit inputs the first reference text as an input value of the second deep learning algorithm stored in advance to perform deep learning, the second voice model deep learning unit, the first reference A second error rate calculation unit that calculates a second error rate value between the text and a second reference text that is an output value of the second voice model deep learning unit, and the first voice model deep learning unit deepens based on the first error rate value and the second error rate value. It further includes a weight value calculation unit that calculates weight values for running.
  • the first deep learning unit for speech model sets the weight value as the weight value of the first deep learning algorithm stored in advance and inputs the first converted text as an input value to perform deep learning.
  • the speech intention classification unit inputs the first request text as an input value of the intention classification algorithm stored in advance to generate speech intention data obtained by calculating the speech intention as a second probability value.
  • the speech intention classification unit determines the speech intention having the highest probability value among the second probability values as the user's speech intention.
  • the speech intention prediction unit inputs speech intention data as an input value of an intention prediction algorithm stored in advance to generate predicted speech data obtained by calculating predicted speech as a third probability value.
  • the speech intention classification unit determines the predicted utterance having the highest probability value among the third probability values as the user's predicted utterance.
  • the response generator uses speech intention data and predicted speech data as input values of a response algorithm stored in advance to generate a response text in response to a user's requested utterance, and a response text generator. It includes a text-to-speech conversion unit that converts into voice data.
  • the at least one processor includes a voice determination unit, a voice processing unit, and Including a response generation unit, determining the user's gender and age based on the user's requested utterance from one point to another by the voice determination unit, converting the user's requested utterance into text by the voice processing unit, and Determining the utterance intention of the user and determining the predicted utterance of the user after another point in time, and generating a response to the user's requested utterance based on the user's utterance intention and the predicted predicted utterance by a response generating unit.
  • it includes a computer-readable non-transitory recording medium on which a program for executing a voice processing system based on artificial intelligence according to an embodiment of the present invention is recorded.
  • the dialogue situation prediction and intention classification system based on artificial intelligence according to the present invention can accurately recognize a user's voice according to the user's gender and age.
  • the dialogue situation prediction and intention classification system based on artificial intelligence can determine the user's speech intention based on the user's voice.
  • the conversation situation prediction and intention classification system based on artificial intelligence can predict the user's utterance after the user's utterance based on the user's utterance intention.
  • the dialogue situation prediction and intention classification system based on artificial intelligence may generate a response based on the user's speech intention and the predicted user's speech.
  • FIG. 1 is a diagram illustrating a dialogue situation prediction and intention classification system based on artificial intelligence according to an embodiment of the present invention.
  • FIG. 2 is a diagram explaining a process of determining the gender and age of a user and extracting second voice data according to an embodiment of the present invention.
  • 3 is a diagram explaining a user's request utterance and first converted text according to an embodiment of the present invention.
  • FIG. 4 is a diagram illustrating a second deep learning unit according to an embodiment of the present invention.
  • FIG. 5 is a diagram explaining a process of determining a user's utterance intention and predicted utterance and generating a response according to an embodiment of the present invention.
  • FIG. 6 is a diagram illustrating a method for predicting conversation situations and classifying intentions based on artificial intelligence according to an embodiment of the present invention.
  • FIG. 7 is a diagram explaining a method of adjusting a first deep learning algorithm according to an embodiment of the present invention.
  • the expression "the same” in the description may mean “substantially the same”. That is, it may be the same to the extent that a person with ordinary knowledge can understand that it is the same.
  • Other expressions may also be expressions in which "substantially” is omitted.
  • ' ⁇ unit' used in this specification is a unit that processes at least one function or operation, and may mean, for example, software, an FPGA, or a hardware component. Functions provided by ' ⁇ unit' may be performed separately by a plurality of components or may be integrated with other additional components. ' ⁇ unit' in this specification is not necessarily limited to software or hardware, and may be configured to be in an addressable storage medium or configured to reproduce one or more processors.
  • FIG. 1 is a diagram illustrating a dialogue situation prediction and intention classification system based on artificial intelligence according to an embodiment of the present invention.
  • a dialogue situation prediction and intention classification system 1 based on artificial intelligence includes at least one processor, and the at least one processor includes a voice determination unit 10, a voice processing unit 20, and The response generator 30 may be implemented or included.
  • the voice determination unit 10 may include a voice extraction unit 100 , a first deep learning unit 110 and a voice recognizer distribution unit 120 .
  • the voice processing unit 20 may include a voice-to-text conversion unit 200, a second deep learning unit 210, a speech intention classification unit 220, and a speech intention prediction unit 230.
  • the response generator 30 may include a response text generator 300 and a text-to-speech converter 310 .
  • the artificial intelligence speaker 2 may recognize voice based on the user 3's speech.
  • the artificial intelligence speaker 2 may output a response corresponding to the user 3's speech. Through this, the user 3 can obtain various information from the artificial intelligence speaker 2.
  • the speech of the user 3 input to the artificial intelligence speaker 2 will be referred to as 'request speech'.
  • the voice determination unit 10 may recognize the requested utterance of the user 3 from one point in time to another point in time.
  • the voice extracting unit 100 may extract voice data at any point in time based on the requested utterance of the user 3 .
  • the voice data of the user 3 extracted by the voice extractor 100 at any point in time will be referred to as first voice data.
  • the voice extraction unit 100 may extract first voice data including 'Air,' which is a requested utterance of the user 3 at any point in time.
  • the first deep learning unit 110 may calculate the gender and age of the user 3 as probability values by using the first voice data as an input value of the first decision algorithm stored in advance.
  • the first deep learning unit 110 may determine the user's gender and age based on the probability value.
  • the first deep learning unit 111 inputs the first voice data as an input value of the first decision algorithm stored in advance so that the gender of the user 3 is male or female and the age is an adult, an elderly person, or a child.
  • a probability corresponding to any one of the above may be calculated as a probability value.
  • a probability value calculated by inputting the first voice data in the first deep learning unit 111 as an input value of a first decision algorithm stored in advance will be referred to as a first probability value.
  • the first deep learning unit 11 may determine the gender and age corresponding to the highest probability value among the calculated first probability values as the gender and age of the user 3 .
  • a process in which the first deep learning unit 111 determines the gender and age of the user 3 by inputting the first voice data as an input value of a previously stored decision algorithm will be described in detail with reference to FIG. 2 below.
  • the voice recognizer distribution unit 120 recognizes the requested utterance of the user 3 from one time point to another according to the gender and age of the user 3, and generates voice data based on the user 3's requested utterance. may include a plurality of different voice recognizers.
  • the voice recognizer distribution unit 120 determines one voice recognizer corresponding to the user's gender and age, and recognizes the user's requested utterance from one point in time to another point in time using the one voice recognizer. there is.
  • Any one of the voice recognizers may recognize a user's requested utterance and extract voice data based on it.
  • voice data recognized and extracted by any one voice recognizer included in the voice recognizer distribution unit 120 will be referred to as second voice data.
  • the voice recognizer distribution unit 120 is any one voice recognizer for recognizing the requested utterance of an adult male. can decide Any one of the voice recognizers may recognize an adult male's requested utterance and generate second voice data.
  • the voice recognizer distribution unit 120 may select any one voice recognizer for recognizing the requested utterance of an adult female. there is. Any one of the voice recognizers may recognize the requested utterance of an adult female and generate second voice data.
  • the voice recognizer distribution unit 120 may select one voice recognizer for recognizing the requested utterance of the elderly male. there is. Any one of the voice recognizers may recognize the requested utterance of the elderly male and generate second voice data.
  • the voice recognizer distribution unit 120 may select one voice recognizer for recognizing the requested utterance of the elderly woman. there is. Any one of the voice recognizers may recognize the elderly woman's requested utterance and generate second voice data.
  • the voice recognizer distribution unit 120 may select one voice recognizer for recognizing the requested utterance of the male child. there is. Any one of the voice recognizers may recognize the boy's requested utterance and generate second voice data.
  • the voice recognizer distribution unit 120 may select one voice recognizer for recognizing the requested utterance of a girl child. there is. Any one of the voice recognizers may recognize the girl child's requested utterance and generate second voice data.
  • the voice recognizer distributing unit 120 provides any one voice for recognizing the requested utterance of the user 3 corresponding to either male or female gender and any one of the elderly, adults, and children. recognizer can be determined.
  • any one voice recognizer may recognize the requested utterance of the user 3 from one point in time to another point in time. Any one of the voice recognizers may extract the second voice data based on the requested utterance of the user 3 .
  • the voice processing unit 20 may determine the user's speech intention by converting the user's requested speech from one point in time to another point in time into text.
  • the voice processing unit 20 may determine the user's predicted utterance after another point in time.
  • the voice-to-text conversion unit 200 may convert the second voice data extracted from any one voice recognizer included in the voice recognizer distribution unit 120 into text. At this time, the text may contain errors.
  • text converted by the speech-to-text conversion unit 200 and including errors will be referred to as first converted text.
  • one of the voice recognizers may misrecognize the requested utterance of the user 3 due to various external or internal factors, such as the inclusion of external noise in the process of recognizing the requested utterance of the user 3.
  • one voice recognizer that recognizes an adult man's requested utterance may mistakenly recognize it as 'Air tell me the schedule for May' based on the adult man's requested utterance 'Air tell me today's weather'.
  • one of the voice recognizers may extract second voice data composed of 'Air tell me the schedule for May' based on this.
  • the voice-to-text conversion unit 200 may convert the second voice data into first converted text, and the first converted text may be composed of 'Air, let me know the schedule for May' including an error.
  • the second deep learning unit 210 may generate the first request text by inputting the first converted text as an input value of the first deep learning algorithm stored in advance.
  • the first deep learning algorithm when the first converted text is input as an input value of the first deep learning algorithm stored in advance in the second deep learning unit 210, the first deep learning algorithm outputs the request text corresponding to the user's speech request as an output value.
  • the request text generated by the first deep learning algorithm previously stored in the second deep learning unit 210 will be referred to as a first request text.
  • the second deep learning unit 210 is composed of an input layer, a hidden layer, and an output layer, and the first deep learning algorithm inputs the first converted text as an input value of the input layer and outputs the first request text from the output layer. can mean process.
  • the first converted text input to the input layer may be converted into first request text as an output value by adding predetermined weight values in the hidden layer.
  • the predetermined weight values added in the hidden layer of the second deep learning unit 210 may be reset by weight values including the first error rate and the second error rate to be described in FIG. 4 below, and through the above process, the first deep The learning algorithm can be tuned.
  • the second deep learning unit 210 may calculate an error rate (or loss value) between the first converted text and the first request text output from the first deep learning algorithm.
  • the error rate (or loss value) calculated between the first converted text in the second deep learning unit 210 and the first request text output from the first deep learning algorithm is the first error rate (or the first loss value). ) to be named.
  • the first request text output from the first deep learning algorithm corresponds to the user's 'speech request'. may not correspond exactly.
  • the second deep learning unit 210 uses the first request text output from the first deep learning algorithm and the first request text input as input values of the first deep learning algorithm.
  • a first error rate (or first loss value) between converted texts may be calculated.
  • the second deep learning algorithm of the second deep learning unit 210 uses the first reference text (or first transcript text) as an input value of the previously stored second deep learning algorithm, and the output value is the first reference text (or first transcription text).
  • the second deep learning algorithm may be fine-tuned to output the same second reference text (or second transcription text) as the transcription text.
  • the first reference text (or transcription text) corresponds to a text that exactly corresponds to the user's utterance intention and corresponds to 'Air tell me today's weather' without an error.
  • the second deep learning unit 210 connects the first reference text (or first transcription text) and the second reference text (or second transcription text) output as the output value of the second deep learning algorithm stored in advance.
  • a second error rate (or second loss value) of can be calculated.
  • the second deep learning algorithm of the second deep learning unit 210 may fine-tune the second deep learning algorithm based on the second error rate (or the second loss value) and perform deep learning.
  • the second deep learning unit 210 may calculate a weight value using a first error rate (or a first loss value) and a second error rate (or a second loss value). The second deep learning unit 210 may fine-tune the previously stored first deep learning algorithm using the generated weight values.
  • the first deep learning algorithm of the second deep learning unit 210 accurately recognizes the user's requested utterance and text (or , the first request text).
  • the speech intention classification unit 220 may generate speech intention data for the user's speech intention by inputting the first request text as an input value of the intention classification algorithm stored in advance.
  • the speech intention classification unit 220 may calculate the user's speech intention as a probability value by inputting the first request text 'Air, tell me the weather today' as an input value of a pre-stored intention classification algorithm.
  • the probability value calculated by the speech intention classification unit 220 will be referred to as a second probability value.
  • the speech intention classification unit 220 may determine the speech intention corresponding to the highest probability value among the calculated second probability values as the user's speech intention. For example, the speech intention classification unit 220 may determine that the user's speech intention is 'weather' of 'today'.
  • the speech intention prediction unit 230 may generate predicted speech data for the user's predicted speech by inputting speech intention data as an input value of an intention prediction algorithm stored in advance.
  • the speech intention prediction unit 230 may calculate the user's predicted speech as a probability value by inputting 'today' and 'weather', which are speech intention data, as input values of an intention prediction algorithm stored in advance.
  • the probability value calculated by the speech intention prediction unit 230 will be referred to as a third probability value.
  • the utterance intention prediction unit 230 may determine the predicted utterance corresponding to the highest probability value among the calculated third probability values as the user's predicted utterance.
  • the speech intention prediction unit 230 may determine 'clothes' as the user's predicted utterance at a first point in time, which is after another point in time when the user 3's requested utterance ends. In addition, the speech intention prediction unit 230 may determine the user's predicted speech at a second time point after the first time point as a 'place'.
  • the speech intention classification unit 220 may determine the user's speech intention from one point in time to another point in time based on the first request text and generate speech intention data.
  • the speech intention prediction unit 230 may determine the user's predicted speech after another point in time when the user 3's speech request ends, and generate predicted speech data.
  • the response generation unit 30 generates a response to the user's requested utterance based on the user's speech intention determined by the speech intention classification unit 220 and the user's predicted speech determined by the speech intention prediction unit 230. can do.
  • the response text generator 300 may generate a response to the user's requested utterance by inputting utterance intention data and predicted utterance data as input values of a response algorithm stored in advance.
  • the response text generator 300 selects 'today' Speech intention data including 'weather' and predicted speech data including 'clothes' and 'place' may be input as input values of a previously stored response algorithm.
  • the response text generating unit 300 says, 'Today's weather is hot. It is recommended to go outside wearing thin and long clothes to prevent air-conditioning sickness.'
  • the text-to-speech conversion unit 310 may convert the response text generated by the response text generation unit 300 into voice data.
  • the text-to-speech conversion unit 310 may transmit voice data to the artificial intelligence speaker 2 and output the result as a response to the user 3's requested utterance.
  • FIG. 2 is a diagram explaining a process of determining the gender and age of a user and extracting second voice data according to an embodiment of the present invention.
  • the voice extraction unit 100 may extract first voice data ('Air Ya') at any one point in time based on the requested utterance of the user 3.
  • the first deep learning unit 110 may calculate the gender and age of the user 3 as probability values by using the first voice data ('Air') as an input value of the first decision algorithm stored in advance.
  • the voice extraction unit 100 extracts first voice data ('Air Ya') based on the user 3's requested utterance, and the first deep learning unit 110 extracts the first voice data ( It is assumed that 'Air') is input as an input value of the first decision algorithm stored in advance.
  • a first probability value that the user 3 is male and the age is an adult can be calculated as 0.9 as an output value. . That is, the first deep learning unit 110 may determine that the first probability value that the user 3 has a male gender and an adult age is 0.9.
  • a first probability value that the user 3 has a female gender and an adult age can be calculated as 0.02 as an output value.
  • the first deep learning unit 110 may calculate a first probability value that the user 3 has a female gender and an adult age as 0.02.
  • a first probability value that the user 3 is a male and an elderly person can be calculated as 0.03 as an output value. That is, the first deep learning unit 110 may calculate a first probability value of 0.03 when the gender of the user 3 is male and the age is an elderly person.
  • a first probability value that the user 3 is a woman and an elderly person can be calculated as 0.02 as an output value. That is, the first deep learning unit 110 may calculate a first probability value that the user 3 has a female gender and an elderly age as 0.02.
  • a first probability value that the user 3 is male and the age is a child can be calculated as 0.02 as an output value.
  • the first deep learning unit 110 may calculate a first probability value that the user 3 has a male gender and a child age as 0.02.
  • a first probability value that the user 3 has a female gender and a child age can be calculated as 0.01 as an output value.
  • the first deep learning unit 110 may calculate a first probability value that the user 3 has a female gender and a child age as 0.01.
  • the first deep learning unit 110 may determine the gender (male) and age (adult) corresponding to the highest probability value (0.9) among the first probability values as the gender and age of the user 3 .
  • the voice recognizer distribution unit 120 may determine any one voice recognizer for recognizing an adult male's requested utterance. Any one voice recognizer selected by the voice recognizer distribution unit 120 may recognize a user's requested utterance from a certain point in time to another point in time, and extract second voice data based thereon.
  • the voice recognizers recognizes the requested utterance of the adult male user 3, and based on this, the second Voice data can be extracted.
  • FIG. 3 is a diagram explaining a user's request utterance and first converted text according to an embodiment of the present invention.
  • the voice recognizer distribution unit 120 may determine one of the voice recognizers for recognizing the requested utterance of the user 3. Any one of the voice recognizers may recognize the requested utterance of the user 3 and change the first converted text based thereon.
  • any one voice recognizer may misrecognize the requested utterance of the user 3 due to various external or internal factors, such as the inclusion of external noise in the process of recognizing the requested utterance of the user 3 .
  • one voice recognizer may extract second voice data including 'Air, make a reservation for May'. Any one of the voice recognizers may convert the second voice data into first converted text consisting of 'Air, make a reservation for May'.
  • any one voice recognizer may extract second voice data including 'Air ## Tell it to Genie'. Any one of the voice recognizers may convert the second voice data into first converted text composed of 'Air ## Tell it to Genie'.
  • the requested utterance of the user (3) a child and a male
  • one of the voice recognizers sends the requested utterance of the user (3) to 'Play Air Pororo at 6 o'clock' It can be misinterpreted as 'give'.
  • one of the voice recognizers may extract second voice data including 'play the air lottery game at 6 o'clock'. Any one of the voice recognizers may convert the second voice data into a first converted text consisting of 'play the air lottery game at 6 o'clock'.
  • one of the voice recognizers may extract second voice data including 'Air ## do not play YouTube to study'. Any one of the voice recognizers may convert the second voice data into first converted text composed of 'Don't play ##StudyingYouTube.'
  • any one voice recognizer may misrecognize the requested utterance of the user 3 due to various external or internal factors, such as the inclusion of external noise in the process of recognizing the requested utterance of the user 3. there is.
  • second voice data may be extracted and first converted text may be generated.
  • FIG. 4 is a diagram illustrating a second deep learning unit according to an embodiment of the present invention.
  • the second deep learning unit 210 includes a converted text input unit 211, a first speech model deep learning unit 212, a request text output unit 213, a reference text storage unit 214, and a second speech model deep learning unit. 215, a weight value calculation unit 216, a first error rate calculation unit 2120, and a second error rate calculation unit 2150 may be included.
  • the converted text input unit 211 may receive the first converted text converted by any one voice recognizer.
  • the converted text input unit 211 may receive a first converted text consisting of 'Air, make a reservation for May'.
  • the first converted text may be input as an input value of a first deep learning algorithm stored in advance.
  • the first deep learning algorithm previously stored in the first voice model deep learning unit 212 may generate the first request text as an output value.
  • the first voice model deep learning unit 212 may adjust the first deep learning algorithm using the weight values provided by the weight value calculation unit 216 to be described below.
  • the first voice model deep learning unit 212 may perform deep learning by inputting the first converted text as an input value of the adjusted first deep learning algorithm and outputting the first request text.
  • the first error rate calculation unit 2120 includes the first converted text input to the first voice model deep learning unit 212 generated by the first speech model deep learning unit 212 and the first converted text output from the first deep learning algorithm. A first error rate (or a first loss value) between request texts may be calculated. The first error rate calculation unit 2120 may provide the first error rate (or first loss value) to the weight value calculation unit 216 .
  • the request text output unit 213 transmits the first request text ('Air, tell me today's weather') generated by the first deep learning algorithm that is pre-stored in the first voice model deep learning unit 212 and finely tuned to the response generator. (30, see FIG. 1).
  • the reference text storage unit 214 may store the first reference text (or first transcription text) in advance.
  • the reference text storage unit 214 stores in advance a first reference text (or first transcription text) composed of 'Air, make a reservation for May', which corresponds to the user's requested utterance and does not include an error.
  • the second voice model deep learning unit 215 may receive the first reference text (or first transcription text) from the reference text storage unit 214 .
  • the second voice model deep learning unit 215 inputs the first reference text (or first transcription text) as an input value of the second deep learning algorithm stored in advance, and uses the first reference text (or first transcription text)
  • the second deep learning algorithm may be fine-tuned to output the same second reference text (or second transcript text) and deep learning may be performed.
  • the second voice model deep learning unit 215 sets the first reference text (or first transcription text) provided from the reference text storage unit 214 as an input value of the second deep learning algorithm stored in advance, Deep learning may be performed by using the same second reference text (or second transcription text) as the first reference text (or first transcription text) as an output value of the second deep learning algorithm stored in advance.
  • the second error rate calculation unit 2150 includes the first reference text (or first transcription text) and the first reference text (or, A second error rate between the second reference text (or the second transcription text) outputted from the pre-stored second deep learning algorithm after inputting the first transcription text may be calculated.
  • the second error rate calculation unit 2150 may provide the second error rate to the weight value calculation unit 216 .
  • the weight value calculation unit 216 may calculate weight values for fine-tuning the first deep learning algorithm of the first voice model deep learning unit 212 based on the first error rate value and the second error rate value.
  • the weight value may be expressed as [Equation 1] below.
  • Weight value a*first error rate value + b*second error rate value
  • the weight value calculation unit 216 may provide the weight values to the first voice model deep learning unit 212 .
  • the weight value provided by the weight value calculation unit 216 is determined as the weight value of the first voice model deep learning unit 212 to adjust the first deep learning algorithm, and the first converted text is used to determine the weight value of the adjusted first deep learning algorithm.
  • the first voice model deep learning unit 212 may be deep-learned by inputting an input value.
  • the weight values provided from the weight value calculator 216 may be determined as weight values added in the hidden layer.
  • the first voice model deep learning unit 212 inputs the first converted text containing errors as an input value of the first deep learning algorithm stored in advance.
  • the error is removed, and the first request text exactly corresponding to the user's request utterance can be determined.
  • the first voice model deep learning unit 212 is a first error rate value (or a first loss value) calculated between the first converted text containing errors and the first request text. ) and the first reference text (or first transcription text) as an input value and the second reference text (or second transcription text) as an output value, and a weight value including a second error rate value calculated as a first voice model
  • the first deep learning algorithm may be fine-tuned using the weight value of the deep learning unit 212 .
  • the first voice model deep learning unit 212 may input the first conversion test as an input value of the adjusted first deep learning algorithm and generate a first request text that exactly corresponds to the user's request utterance.
  • FIG. 5 is a diagram explaining a process of determining a user's utterance intention and predicted utterance and generating a response according to an embodiment of the present invention.
  • the speech intention classification unit 220 may receive a first request text corresponding to the user's requested speech.
  • the speech intention classification unit 220 may calculate the user's speech intention as a second probability value by inputting the first request text as an input value of the intention classification algorithm stored in advance.
  • the speech intention classification unit 220 may determine the speech intention based on the user's speech request based on the second probability value.
  • the speech intention classification unit 220 inputs the first request text ('Let me know the weather today') as an input value of a pre-stored intention classification algorithm, and converts the user's speech intention into a second probability value. can be calculated
  • the speech intention classification unit 220 may calculate a second probability value that the user's speech intention is 'weather' as 0.9. .
  • the speech intention classification unit 220 may calculate a second probability value that the user's speech intention is 'clothes' as 0.1.
  • the speech intention classification unit 220 may determine the speech intention ('weather') corresponding to the highest probability value (0.9) among the second probability values as the speech intention based on the user's speech request.
  • the speech intention classification unit 220 may generate speech intention data including 'weather' based on the user's speech intention.
  • the speech intention prediction unit 230 may receive speech intention data corresponding to the user's speech intention.
  • the speech intention prediction unit 230 may calculate the user's predicted speech as a third probability value by inputting the speech intention data as an input value of the intention prediction algorithm stored in advance.
  • the speech intention prediction unit 230 may determine the user's predicted speech based on the third probability value.
  • the speech intention prediction unit 230 inputs speech intention data ('weather') as an input value of an intention prediction algorithm stored in advance to predict the user's speech at a first time point after the next time point. It can be calculated with 3 probability values.
  • the speech intention prediction unit 230 calculates a third probability value that the user's predicted utterance is 'clothes' at the first point in time as 0.6. can do.
  • the utterance intention prediction unit 230 may calculate a third probability value that the predicted utterance of the user at the first time point after the next time point is 'place' as 0.4.
  • the utterance intention prediction unit 230 may determine the predicted utterance ('clothing') corresponding to the highest probability value (0.6) among the third probability values as the user's predicted utterance at the first time point.
  • the speech intention prediction unit 230 may generate first predicted speech data including 'clothes' based on the user's predicted speech.
  • the speech intention prediction unit 230 inputs the first predicted speech data ('clothes') as an input value of the intention prediction algorithm stored in advance to predict the user's predicted speech at the second time point after the first time point as a third probability value. can be calculated as
  • the speech intention prediction unit 230 sets the third probability value that the user's predicted speech is 'song' at the second point in time to 0.3. can be calculated
  • the utterance intention prediction unit 230 may calculate a third probability value that the predicted utterance of the user at the second time point is 'place' as 0.7.
  • the utterance intention prediction unit 230 may determine the predicted utterance ('place') corresponding to the highest probability value (0.7) among the third probability values as the user's predicted utterance at the second point in time.
  • the speech intention prediction unit 230 may generate second predicted speech data including 'place' based on the user's predicted speech.
  • the speech intention classification unit 220 may provide speech intention data ('weather') and predicted speech data ('clothes', 'place') to the response generator 30.
  • the response text generator 300 may generate a response text in response to a user's requested utterance by inputting utterance intention data and predicted utterance data as input values of a previously stored response algorithm.
  • 'Today's weather is hot.
  • a response text consisting of 'It is recommended to wear thin and long clothes to prevent air-conditioning sickness' may be generated as an output.
  • the response text generator 300 says 'Today's weather is hot.
  • a response text consisting of 'We recommend thin and long clothes to prevent air-conditioning sickness' can be created.
  • the text-to-speech conversion unit 310 may convert response text ('The weather is hot today. We recommend wearing thin and long clothes to prevent air-conditioning sickness.') into voice data.
  • the text-to-speech conversion unit 310 may transmit voice data to the artificial intelligence speaker 20 .
  • the artificial intelligence speaker 2 can output voice data ('The weather is hot today. We recommend wearing thin and long clothes to prevent air-conditioning sickness') as a response to the user's request.
  • FIG. 6 is a diagram illustrating a voice processing method based on artificial intelligence according to an embodiment of the present invention.
  • the voice extraction unit may extract first voice data based on the user's requested utterance at any point in time.
  • the voice extraction unit 100 performs a first step based on the user's requested utterance at a certain point in time. Voice data can be extracted.
  • the first deep learning unit may determine the gender and age of the user by inputting the first voice data as an input value of the first determination algorithm stored in advance.
  • the first deep learning unit 110 may calculate the gender and age of the user 3 as a first probability value by using the first voice data as an input value of the first decision algorithm stored in advance. At this time, the first deep learning unit 11 may determine the gender and age corresponding to the highest probability value among the calculated first probability values as the gender and age of the user 3 .
  • the voice recognizer distribution unit may determine one voice recognizer corresponding to the user's gender and age.
  • the voice recognizer distribution unit 120 recognizes the requested utterance of the user 3 from one point to another according to the gender and age of the user 3, and a plurality of different voice recognizers that generate voice data. can include The voice recognizer distribution unit 120 may determine one voice recognizer corresponding to the user's gender and age.
  • any one voice recognizer may extract second voice data based on the user's requested utterance from one point in time to another point in time.
  • any one voice recognizer corresponding to the user's gender and age. Any one of the voice recognizers may recognize the requested utterance of the user 3 and extract the second voice data based on it.
  • the voice-to-text converter may convert the second voice data into the first converted text.
  • the voice-to-text conversion unit 200 may convert second voice data extracted from any one voice recognizer included in the voice recognizer distribution unit 120 into first converted text.
  • the first requested text may be generated by inputting the first converted text as the input value of the first deep learning algorithm stored in advance in the first voice model deep learning unit.
  • the first request text may be input as an input value of the intention classification algorithm previously stored in the speech intention classification unit to determine the speech intention and generate speech intention data.
  • the speech intention classification unit 220 may calculate the user's speech intention as a second probability value by inputting the first request text as an input value of the intention classification algorithm stored in advance.
  • the speech intention classification unit 220 may determine the speech intention corresponding to the highest probability value among the calculated second probability values as the user's speech intention.
  • the predicted speech may be determined by inputting speech intention data as an input value of the intention prediction algorithm stored in advance in the speech intention prediction unit, and predicted speech data may be generated.
  • the speech intention prediction unit 230 may calculate the user's predicted speech as a third probability value by inputting speech intention data as an input value of the intention prediction algorithm stored in advance.
  • the utterance intention prediction unit 230 may determine the predicted utterance corresponding to the highest probability value among the calculated third probability values as the user's predicted utterance.
  • a response may be generated by inputting speech intention data and predicted speech data as input values of the response algorithm previously stored in the response text generator.
  • the response text generator 300 may generate a response to the user's requested utterance by inputting utterance intention data and predicted utterance data as input values of a response algorithm stored in advance.
  • FIG. 7 is a diagram explaining a method of adjusting a first deep learning algorithm according to an embodiment of the present invention.
  • the request text output unit may provide the first converted text and the first request text to the first error rate calculator.
  • the request text output unit 213 may provide the first converted text and the first request text to the first error rate calculator 2120 .
  • the first error rate calculation unit 2120 may calculate a first error rate (or a first loss value) between the first converted text and the first request text.
  • the reference text storage unit 214 may provide the first reference text (or first transcription text) to the second voice model deep learning unit 215.
  • the first reference text (or first transcription text) may be input as an input value of the second deep learning algorithm previously stored in the second voice model deep learning unit 215, and an output value may be output.
  • the second error rate calculation unit 2150 may calculate a second error rate between the first reference text (or first transcription text) and the second reference text (or second transcription text).
  • the second voice model deep learning unit 215 is configured to output the second reference text (or second transcription text) identical to the first reference text (or first transcription text) by the second voice model deep learning unit. It can be fine-tuned and deep-learned.
  • the second error rate calculation unit 2150 sets the first reference text (or first transcription text) provided from the reference text storage unit 214 as an input value of the second deep learning algorithm stored in advance, and between the output values A second error rate (or second loss value) of can be calculated.
  • the second speech model deep learning unit 215 uses a second error rate (or a second loss value) so that the output value of the second deep learning algorithm is the same as the first reference text (or first transcription text).
  • the second deep learning algorithm can be fine-tuned until it becomes text (or second transcription text).
  • the weight value calculation unit 216 may calculate the weight value using the first error rate (or the first loss value) and the second error rate (or the second loss value).
  • the weight value calculation unit 216 calculates the first voice model deep learning unit 212 based on the first error rate value (or first loss value) and the second error rate value (or second loss value). Weight values for fine adjustment can be calculated.
  • the weight value calculation unit may provide the weight values to the first speech model deep learning unit.
  • the first voice model deep learning unit may fine-tune the pre-stored first deep learning algorithm based on the weight value.
  • the weight value provided by the weight value calculation unit 216 is used as the weight value of the first speech model deep learning unit 212 to adjust the first deep learning algorithm, and the first converted text is fine-tuned.
  • the first voice model deep learning unit 212 may be deep-learned by inputting the input value of the learning algorithm.
  • the embodiments described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components.
  • the devices, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA) array), programmable logic units (PLUs), microprocessors, or any other device capable of executing and responding to instructions.
  • ALU arithmetic logic unit
  • FPGA field programmable gate
  • PLUs programmable logic units
  • microprocessors or any other device capable of executing and responding to instructions.
  • a processing device may run an operating system and one or more software applications running on the operating system.
  • a processing device may also access, store, manipulate, process, and generate data in response to execution of software.
  • a processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It will be understood that it can include
  • a processing device may include a plurality of processors or a processor and a controller. Also, other processing configurations are possible, such as a parallel processor.
  • Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device.
  • Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.
  • the method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium.
  • Computer readable media may include program instructions, data files, data structures, etc. alone or in combination.
  • Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software.
  • Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CDROMs and DVDs, and ROMs, RAMs, and flash memories.
  • the hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

An artificial intelligence-based dialogue situation prediction and intention classification system according to the present invention comprises at least one processor that determines the gender and age of a user on the basis of a user's request utterance uttered from one point in time to another point in time, determines the utterance intention of the user by converting the request utterance of the user to text, predicts an expected utterance of the user after the other point in time, and generates a response to the user's request utterance on the basis of the predicted expected utterance and the utterance intention of the user.

Description

인공지능에 기초한 대화상황예측과 의도분류 시스템 및 그 방법Dialogue situation prediction and intention classification system based on artificial intelligence and its method
본 발명은 인공지능에 기초한 대화상황예측과 의도분류 시스템 및 그 방법에 관한 것이다.The present invention relates to a conversation situation prediction and intention classification system and method based on artificial intelligence.
최근에 인공지능을 이용한 기술이 발달함에 따라서 인공지능스피커에 대한 수요가 늘어나고 있다. Recently, as technology using artificial intelligence develops, the demand for artificial intelligence speakers is increasing.
일반적으로, 인공지능스피커는 사용자로부터 제공된 음성을 인식하고 내장된 알고리즘에 기초하여 응답을 생성하고 출력할 수 있다. 사용자는 인공지능스피커를 이용하여 다양한 정보에 대해서 편리하게 접근할 수 있다. In general, an artificial intelligence speaker can recognize a voice provided by a user and generate and output a response based on a built-in algorithm. Users can conveniently access various information using artificial intelligence speakers.
한편, 잡음이 존재하는 환경 또는, 다양한 외부/내부 요인으로 인해서 인공지능스피커는 사용자의 음성을 정확하게 인식할 수 없고 사용자에게 부정확한 정보를 제공할 수 있다. On the other hand, due to a noisy environment or various external/internal factors, the artificial intelligence speaker cannot accurately recognize the user's voice and may provide inaccurate information to the user.
또한, 사용자의 발화에 기초하여 응답을 생성하므로 사용자에게 제공할 수 있는 정보는 제한적일 수 있다. 이에, 사용자는 원하는 응답을 얻기 위해 여러 번 인공지능스피커를 사용하여야 하므로 정보획득의 측면에서 효율성이 떨어질 수 있다. In addition, since a response is generated based on the user's speech, information that can be provided to the user may be limited. Accordingly, since the user has to use the artificial intelligence speaker several times to obtain a desired response, efficiency may decrease in terms of information acquisition.
이에, 다양한 환경에서 작동 가능하며 사용자에게 효율적으로 다양한 정보를 제공할 수 있는 기술이 필요한 실정이다.Accordingly, there is a need for a technology capable of operating in various environments and efficiently providing various information to users.
또한, 본 발명은 과학기술정보통신부의 정보통신방송혁신인재양성(R&D)(과제고유번호: 1711125907, 과제번호: 2020-0-01808-002, 연구과제명: 복합정보 기반 예측지능 혁신 기술 연구, 과제관리기관: 정보통신기획평가원, 과제수행기관: 경북대학교산학협력단, 연구기간: 2021.01.01.~2021.12.31.)의 일환으로 수행한 연구로부터 도출된 것이다. 한편, 본 발명의 모든 측면에서 한국 정부의 재산 이익은 없다.In addition, the present invention is a Ministry of Science and ICT Ministry of Information, Communication and Broadcasting Innovation Talent Fostering (R & D) (Task identification number: 1711125907, task number: 2020-0-01808-002, research task name: complex information-based predictive intelligence innovative technology research, It was derived from a study conducted as part of the project management agency: Information and Communications Technology Planning and Evaluation Institute, project executing agency: Kyungpook National University Industry-University Cooperation Foundation, research period: 2021.01.01. ~ 2021.12.31.). Meanwhile, there is no property interest of the Korean government in any aspect of the present invention.
본 발명이 해결하고자 하는 기술적 과제는 사용자의 성별 및 연령에 따라서 사용자의 음성을 인식하는 인공지능에 기초한 대화상황예측과 의도분류 시스템 및 그 방법을 제공하기 위함이다. A technical problem to be solved by the present invention is to provide a conversation situation prediction and intention classification system and method based on artificial intelligence that recognizes a user's voice according to the user's gender and age.
또한, 본 발명이 해결하고자 하는 기술적 과제는 사용자의 음성에 기초하여 사용자의 발화의도를 판단하는 인공지능에 기초한 대화상황예측과 의도분류 시스템 및 그 방법을 제공하기 위함이다.In addition, a technical problem to be solved by the present invention is to provide a conversation situation prediction and intention classification system and method based on artificial intelligence for determining a user's speech intention based on a user's voice.
또한, 본 발명이 해결하고자 하는 기술적 과제는 사용자의 발화의도에 기초하여 사용자의 발화 이후의 사용자의 발화를 예측하는 인공지능에 기초한 대화상황예측과 의도분류 시스템 및 그 방법을 제공하기 위함이다.In addition, a technical problem to be solved by the present invention is to provide a dialogue situation prediction and intention classification system and method based on artificial intelligence that predicts the user's speech after the user's speech based on the user's speech intention.
또한, 본 발명이 해결하고자 하는 기술적 과제는 사용자의 발화의도와 예측된 사용자의 발화에 기초하여 응답을 생성하는 인공지능에 기초한 대화상황예측과 의도분류 시스템 및 그 방법을 제공하기 위함이다.In addition, a technical problem to be solved by the present invention is to provide a dialogue situation prediction and intention classification system and method based on artificial intelligence that generates a response based on the user's speech intention and the predicted user's speech.
본 발명의 한 실시예에 따른 적어도 하나의 프로세서를 포함하는 인공지능에 기초한 대화상황예측과 의도분류 시스템에 있어서, 적어도 하나의 프로세서는 어느 한 시점에서 다른 시점까지의 사용자의 요청발화에 기초하여 사용자의 성별 및 연령을 판단하는 음성판단부, 사용자의 요청발화를 텍스트로 변환하여 사용자의 발화의도를 판단하고 다른 시점 이후의 사용자의 예측발화를 판단하는 음성처리부 및 사용자의 발화의도와 예측된 예측발화에 기초하여 사용자의 요청발화에 대한 응답을 생성하는 응답생성부를 포함한다.In the dialog situation prediction and intention classification system based on artificial intelligence including at least one processor according to an embodiment of the present invention, the at least one processor determines the user's request utterance from one point in time to another point in time. A voice determination unit that determines the gender and age of the user, a voice processing unit that converts the user's requested utterance into text to determine the user's utterance intention, and determines the user's predicted utterance after another point in time, and the user's utterance intention and predicted prediction and a response generation unit that generates a response to the user's requested utterance based on the utterance.
또한, 본 발명의 한 실시예에 따른 음성판단부는, 어느 한 시점에서의 요청발화에 기초하여 제1 음성데이터를 추출하는 음성추출부, 제1 음성데이터를 미리 저장된 제1 판단알고리즘의 입력값으로 입력하여 사용자의 성별 및 연령을 제1 확률값으로 산출하고 제1 확률값에 기초하여 사용자의 성별 및 연령을 결정하는 제1 딥러닝부 및 결정된 성별 및 연령에 기초하여 어느 한 시점에서 다른 시점까지 사용자의 요청발화를 인식하고 제2 음성데이터를 추출하는 음성인식기분배부를 포함한다.In addition, the voice determination unit according to an embodiment of the present invention, the voice extraction unit for extracting the first voice data based on the requested utterance at a certain point in time, the first voice data as an input value of the first decision algorithm stored in advance A first deep learning unit that calculates the user's gender and age as a first probability value and determines the user's gender and age based on the first probability value; and and a voice recognizer distribution unit recognizing the requested utterance and extracting second voice data.
또한, 본 발명의 한 실시예에 따른 제1 딥러닝부는, 제1 확률값 중 가장 높은 확률값에 대응하는 성별 및 연령을 사용자의 성별 및 연령으로 결정한다.In addition, the first deep learning unit according to an embodiment of the present invention determines the gender and age corresponding to the highest probability value among the first probability values as the gender and age of the user.
또한, 본 발명의 한 실시예에 따른 음성인식기분배부는, 결정된 성별 및 연령에 대응하여 사용자의 요청발화를 인식하고 제2 음성데이터를 생성하는 서로 다른 복수의 음성인식기를 포함한다.In addition, the voice recognizer distribution unit according to an embodiment of the present invention includes a plurality of different voice recognizers for recognizing a user's requested utterance and generating second voice data corresponding to the determined gender and age.
또한, 본 발명의 한 실시예에 따른 음성처리부는, 제2 음성데이터를 제1 변환텍스트로 변환하는 음성-텍스트변환부, 제1 변환텍스트를 미리 저장된 제1 딥러닝알고리즘의 입력값으로 입력하여 사용자의 요청발화에 상응하는 제1 요청텍스트를 결정하는 제2 딥러닝부, 제1 요청텍스트를 미리 저장된 의도분류알고리즘의 입력값으로 입력하여 발화의도에 대한 발화의도데이터를 생성하는 발화의도분류부 및 발화의도데이터를 미리 저장된 의도예측알고리즘의 입력값으로 입력하여 예측발화에 대한 예측발화데이터를 생성하는 발화의도예측부를 포함한다.In addition, the voice processing unit according to an embodiment of the present invention, a voice-to-text conversion unit for converting the second voice data into a first converted text, and inputting the first converted text as an input value of a first deep learning algorithm stored in advance A second deep learning unit that determines the first request text corresponding to the user's requested utterance, and inputs the first request text as an input value of a pre-stored intention classification algorithm to generate speech intention data for the utterance intention. and a speech intention prediction unit for generating predicted speech data for predicted speech by inputting the degree classification unit and speech intention data as input values of an intention prediction algorithm stored in advance.
또한, 본 발명의 한 실시예에 따른 제2 딥러닝부는, 제1 변환텍스트를 입력받는 변환텍스트입력부, 제1 변환텍스트를 미리 저장된 제1 딥러닝알고리즘의 입력값으로 입력하여 제1 요청텍스트를 생성하는 제1 음성모델딥러닝부 및 제1 요청텍스트를 출력하는 요청텍스트출력부를 포함한다.In addition, the second deep learning unit according to an embodiment of the present invention inputs the converted text input unit that receives the first converted text, and inputs the first converted text as an input value of the first deep learning algorithm stored in advance to obtain the first request text. It includes a first voice model deep learning unit to generate and a request text output unit to output a first request text.
또한, 본 발명의 한 실시예에 따른 제2 딥러닝부는, 사용자의 요청발화에 상응하는 제1 기준텍스트를 저장하는 기준텍스트저장부 및 제1 변환텍스트와 제1 요청텍스트 사이의 제1 오류율값을 산출하는 제1 오류율산출부를 더 포함한다.In addition, the second deep learning unit according to an embodiment of the present invention includes a reference text storage unit for storing a first reference text corresponding to a user's requested utterance, and a first error rate value between the first converted text and the first requested text. It further includes a first error rate calculation unit that calculates.
또한, 본 발명의 한 실시예에 따른 제2 딥러닝부는, 제1 기준텍스트를 미리 저장된 제2 딥러닝알고리즘의 입력값으로 입력하여 딥러닝을 수행하는 제2 음성모델딥러닝부, 제1 기준텍스트와 제2 음성모델딥러닝부의 출력값인 제2 기준텍스트 사이의 제2 오류율값을 산출하는 제2 오류율산출부 및 제1 오류율값과 제2 오류율값에 기초하여 제1 음성모델딥러닝부를 딥러닝하기 위한 가중치값을 산출하는 가중치값산출부를 더 포함한다.In addition, the second deep learning unit according to an embodiment of the present invention inputs the first reference text as an input value of the second deep learning algorithm stored in advance to perform deep learning, the second voice model deep learning unit, the first reference A second error rate calculation unit that calculates a second error rate value between the text and a second reference text that is an output value of the second voice model deep learning unit, and the first voice model deep learning unit deepens based on the first error rate value and the second error rate value. It further includes a weight value calculation unit that calculates weight values for running.
또한, 본 발명의 한 실시예에 따른 제1 음성모델딥러닝부는, 가중치값을 미리 저장된 제1 딥러닝알고리즘의 가중치값으로 하고, 제1 변환텍스트를 입력값으로 입력하여 딥러닝을 수행한다.In addition, the first deep learning unit for speech model according to an embodiment of the present invention sets the weight value as the weight value of the first deep learning algorithm stored in advance and inputs the first converted text as an input value to perform deep learning.
또한, 본 발명의 한 실시예에 따른 발화의도분류부는, 제1 요청텍스트를 미리 저장된 의도분류알고리즘의 입력값으로 입력하여 발화의도를 제2 확률값으로 산출한 발화의도데이터를 생성한다.In addition, the speech intention classification unit according to an embodiment of the present invention inputs the first request text as an input value of the intention classification algorithm stored in advance to generate speech intention data obtained by calculating the speech intention as a second probability value.
또한, 본 발명의 한 실시예에 따른 발화의도분류부는, 제2 확률값 중 가장 높은 확률값을 가지는 발화의도를 사용자의 발화의도로 결정한다.In addition, the speech intention classification unit according to an embodiment of the present invention determines the speech intention having the highest probability value among the second probability values as the user's speech intention.
또한, 본 발명의 한 실시예에 따른 발화의도예측부는, 발화의도데이터를 미리 저장된 의도예측알고리즘의 입력값으로 입력하여 예측발화를 제3 확률값으로 산출한 예측발화데이터를 생성한다.In addition, the speech intention prediction unit according to an embodiment of the present invention inputs speech intention data as an input value of an intention prediction algorithm stored in advance to generate predicted speech data obtained by calculating predicted speech as a third probability value.
또한, 본 발명의 한 실시예에 따른 발화의도분류부는, 제3 확률값 중 가장 높은 확률값을 가지는 예측발화를 사용자의 예측발화로 판단한다.In addition, the speech intention classification unit according to an embodiment of the present invention determines the predicted utterance having the highest probability value among the third probability values as the user's predicted utterance.
또한, 본 발명의 한 실시예에 따른 응답생성부는, 발화의도데이터 및 예측발화데이터를 미리 저장된 응답알고리즘의 입력값으로하여 사용자의 요청발화에 대한 응답텍스트를 생성하는 응답텍스트생성부 및 응답텍스트를 음성데이터로 변환하는 텍스트-음성변환부를 포함한다.In addition, the response generator according to an embodiment of the present invention uses speech intention data and predicted speech data as input values of a response algorithm stored in advance to generate a response text in response to a user's requested utterance, and a response text generator. It includes a text-to-speech conversion unit that converts into voice data.
또한, 본 발명의 한 실시예에 따른 적어도 하나의 프로세서를 포함하는 인공지능에 기초한 음성 처리 시스템에 의한 인공지능에 기초한 음성 처리 방법에 있어서, 상기 적어도 하나의 프로세서는 음성판단부, 음성처리부, 및 응답생성부를 포함하고, 음성판단부에 의해 어느 한 시점에서 다른 시점까지의 사용자의 요청발화에 기초하여 사용자의 성별 및 연령을 판단하는 단계, 음성처리부에 의해 사용자의 요청발화를 텍스트로 변환하여 사용자의 발화의도를 판단하고 다른 시점 이후의 사용자의 예측발화를 판단하는 단계 및 응답생성부에 의해 사용자의 발화의도와 예측된 예측발화에 기초하여 사용자의 요청발화에 대한 응답을 생성하는 단계를 포함한다. In addition, in the voice processing method based on artificial intelligence by a voice processing system based on artificial intelligence including at least one processor according to an embodiment of the present invention, the at least one processor includes a voice determination unit, a voice processing unit, and Including a response generation unit, determining the user's gender and age based on the user's requested utterance from one point to another by the voice determination unit, converting the user's requested utterance into text by the voice processing unit, and Determining the utterance intention of the user and determining the predicted utterance of the user after another point in time, and generating a response to the user's requested utterance based on the user's utterance intention and the predicted predicted utterance by a response generating unit. do.
또한, 본 발명의 한 실시예에 따른 인공지능에 기초한 음성 처리 시스템을 실행시키는 프로그램이 기록된 컴퓨터로 판독 가능한 비일시적 기록매체를 포함한다.In addition, it includes a computer-readable non-transitory recording medium on which a program for executing a voice processing system based on artificial intelligence according to an embodiment of the present invention is recorded.
본 발명에 따른 인공지능에 기초한 대화상황예측과 의도분류 시스템은 사용자의 성별 및 연령에 따라서 사용자의 음성을 정확하게 인식할 수 있다. The dialogue situation prediction and intention classification system based on artificial intelligence according to the present invention can accurately recognize a user's voice according to the user's gender and age.
또한, 본 발명에 따른 인공지능에 기초한 대화상황예측과 의도분류 시스템은 사용자의 음성에 기초하여 사용자의 발화의도를 판단할 수 있다.In addition, the dialogue situation prediction and intention classification system based on artificial intelligence according to the present invention can determine the user's speech intention based on the user's voice.
또한, 본 발명에 따른 인공지능에 기초한 대화상황예측과 의도분류 시스템은 사용자의 발화의도에 기초하여 사용자의 발화 이후의 사용자의 발화를 예측할 수 있다.In addition, the conversation situation prediction and intention classification system based on artificial intelligence according to the present invention can predict the user's utterance after the user's utterance based on the user's utterance intention.
또한, 본 발명에 따른 인공지능에 기초한 대화상황예측과 의도분류 시스템은 사용자의 발화의도와 예측된 사용자의 발화에 기초하여 응답을 생성할 수 있다.In addition, the dialogue situation prediction and intention classification system based on artificial intelligence according to the present invention may generate a response based on the user's speech intention and the predicted user's speech.
도 1은 본 발명의 한 실시예에 따른 인공지능에 기초한 대화상황예측과 의도분류 시스템을 나타내는 도면이다. 1 is a diagram illustrating a dialogue situation prediction and intention classification system based on artificial intelligence according to an embodiment of the present invention.
도 2는 본 발명의 한 실시예에 따른 사용자의 성별 및 연령을 판단하고 제2 음성데이터를 추출하는 과정을 설명하는 도면이다. 도 3은 본 발명의 한 실시예에 따른 사용자의 요청발화와 제1 변환텍스트를 설명하는 도면이다. 2 is a diagram explaining a process of determining the gender and age of a user and extracting second voice data according to an embodiment of the present invention. 3 is a diagram explaining a user's request utterance and first converted text according to an embodiment of the present invention.
도 4는 본 발명의 한 실시예에 따른 제2 딥러닝부를 설명하는 도면이다. 4 is a diagram illustrating a second deep learning unit according to an embodiment of the present invention.
도 5는 본 발명의 한 실시예에 따른 사용자의 발화의도 및 예측발화를 판단하고 응답을 생성하는 과정을 설명하는 도면이다. 5 is a diagram explaining a process of determining a user's utterance intention and predicted utterance and generating a response according to an embodiment of the present invention.
도 6은 본 발명의 한 실시예에 따른 인공지능에 기초한 대화상황예측과 의도분류 방법을 설명하는 도면이다.6 is a diagram illustrating a method for predicting conversation situations and classifying intentions based on artificial intelligence according to an embodiment of the present invention.
도 7은 본 발명의 한 실시예에 다른 제1 딥러닝알고리즘을 조정하는 방법을 설명하는 도면이다.7 is a diagram explaining a method of adjusting a first deep learning algorithm according to an embodiment of the present invention.
이하, 첨부한 도면을 참고로 하여 본 발명의 여러 실시 예들에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예들에 한정되지 않는다.Hereinafter, various embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may be embodied in many different forms and is not limited to the embodiments set forth herein.
본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 동일 또는 유사한 구성요소에 대해서는 동일한 참조 부호를 붙이도록 한다. 따라서 앞서 설명한 참조 부호는 다른 도면에서도 사용할 수 있다.In order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals are assigned to the same or similar components throughout the specification. Therefore, the reference numerals described above can be used in other drawings as well.
또한, 도면에서 나타난 각 구성의 크기 및 두께는 설명의 편의를 위해 임의로 나타내었으므로, 본 발명이 반드시 도시된 바에 한정되지 않는다. 도면에서 여러 층 및 영역을 명확하게 표현하기 위하여 두께를 과장되게 나타낼 수 있다.In addition, since the size and thickness of each component shown in the drawings are arbitrarily shown for convenience of explanation, the present invention is not necessarily limited to the shown bar. In the drawing, the thickness may be exaggerated to clearly express various layers and regions.
또한, 설명에서 "동일하다"라고 표현한 것은, "실질적으로 동일하다"는 의미일 수 있다. 즉, 통상의 지식을 가진 자가 동일하다고 납득할 수 있을 정도의 동일함일 수 있다. 그 외의 표현들도 "실질적으로"가 생략된 표현들일 수 있다.In addition, the expression "the same" in the description may mean "substantially the same". That is, it may be the same to the extent that a person with ordinary knowledge can understand that it is the same. Other expressions may also be expressions in which "substantially" is omitted.
또한, 설명에서 어떤 부분이 어떤 구성요소를 '포함'한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 본 명세서에서 사용되는 '~부'는 적어도 하나의 기능이나 동작을 처리하는 단위로서, 예를 들어 소프트웨어, FPGA 또는 하드웨어 구성요소를 의미할 수 있다. '~부'에서 제공하는 기능은 복수의 구성요소에 의해 분리되어 수행되거나, 다른 추가적인 구성요소와 통합될 수도 있다. 본 명세서의 '~부'는 반드시 소프트웨어 또는 하드웨어에 한정되지 않으며, 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고, 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 이하에서는 도면을 참조하여 본 발명의 실시예에 대해서 구체적으로 설명하기로 한다.In addition, when a part in the description 'includes' a certain component, it means that it may further include other components, not excluding other components unless otherwise stated. '~ unit' used in this specification is a unit that processes at least one function or operation, and may mean, for example, software, an FPGA, or a hardware component. Functions provided by '~unit' may be performed separately by a plurality of components or may be integrated with other additional components. '~unit' in this specification is not necessarily limited to software or hardware, and may be configured to be in an addressable storage medium or configured to reproduce one or more processors. Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
도 1은 본 발명의 한 실시예에 따른 인공지능에 기초한 대화상황예측과 의도분류 시스템을 나타내는 도면이다. 1 is a diagram illustrating a dialogue situation prediction and intention classification system based on artificial intelligence according to an embodiment of the present invention.
본 발명의 한 실시예에 따른 인공지능에 기초한 대화상황예측과 의도분류 시스템(1)은 적어도 하나의 프로세서를 포함하고, 적어도 하나의 프로세서는 음성판단부(10), 음성처리부(20), 및 응답생성부(30)를 구현 또는 포함할 수 있다. A dialogue situation prediction and intention classification system 1 based on artificial intelligence according to an embodiment of the present invention includes at least one processor, and the at least one processor includes a voice determination unit 10, a voice processing unit 20, and The response generator 30 may be implemented or included.
음성판단부(10)는 음성추출부(100), 제1 딥러닝부(110) 및 음성인식기분배부(120)를 포함할 수 있다. 음성처리부(20)는 음성-텍스트변환부(200), 제2 딥러닝부(210), 발화의도분류부(220), 및 발화의도예측부(230)를 포함할 수 있다. 응답생성부(30)는 응답텍스트생성부(300) 및 텍스트-음성변환부(310)를 포함할 수 있다.The voice determination unit 10 may include a voice extraction unit 100 , a first deep learning unit 110 and a voice recognizer distribution unit 120 . The voice processing unit 20 may include a voice-to-text conversion unit 200, a second deep learning unit 210, a speech intention classification unit 220, and a speech intention prediction unit 230. The response generator 30 may include a response text generator 300 and a text-to-speech converter 310 .
인공지능스피커(2)는 사용자(3)의 발화에 기초한 음성을 인식할 수 있다. 인공지능스피커(2)는 사용자(3)의 발화에 대응하는 응답을 출력할 수 있다. 이를 통해, 사용자(3)는 인공지능스피커(2)로부터 다양한 정보를 획득할 수 있다. 이하, 인공지능스피커(2)에 입력되는 사용자(3)의 발화를 '요청발화'라 명명하기로 한다. The artificial intelligence speaker 2 may recognize voice based on the user 3's speech. The artificial intelligence speaker 2 may output a response corresponding to the user 3's speech. Through this, the user 3 can obtain various information from the artificial intelligence speaker 2. Hereinafter, the speech of the user 3 input to the artificial intelligence speaker 2 will be referred to as 'request speech'.
음성판단부(10)는 어느 한 시점에서 다른 시점까지의 사용자(3)의 요청발화를 인식할 수 있다. The voice determination unit 10 may recognize the requested utterance of the user 3 from one point in time to another point in time.
이하, 도 1에서 어느 한 시점에서 사용자(3)가 요청발화를 시작하며, 어느 한 시점에서 다른 한 시점까지 인공지능스피커(2)에 '에어야 오늘 날씨 알려줘'라는 요청발화를 한 것으로 가정하기로 한다. 이때, '에어야'는 인공지능스피커(2)를 동작시키기 위한 초기 명령어로 가정하기로 한다.Hereinafter, it is assumed that the user 3 starts uttering a request at a point in time in FIG. 1, and utters a request saying 'Air, tell me the weather today' to the artificial intelligence speaker 2 from one point to another point in time. do it with At this time, it is assumed that 'Air' is an initial command for operating the artificial intelligence speaker 2.
음성추출부(100)는 사용자(3)의 요청발화에 기초하여 어느 한 시점에서의 음성데이터를 추출할 수 있다. 이하, 어느 한 시점에서 음성추출부(100)에서 추출되는 사용자(3)의 음성데이터를 제1 음성데이터라 명명하기로 한다. The voice extracting unit 100 may extract voice data at any point in time based on the requested utterance of the user 3 . Hereinafter, the voice data of the user 3 extracted by the voice extractor 100 at any point in time will be referred to as first voice data.
구체적으로, 음성추출부(100)는 어느 한 시점에서의 사용자(3)의 요청발화인 '에어야'를 포함하는 제1 음성데이터를 추출할 수 있다.Specifically, the voice extraction unit 100 may extract first voice data including 'Air,' which is a requested utterance of the user 3 at any point in time.
제1 딥러닝부(110)는 제1 음성데이터를 미리 저장된 제1 판단알고리즘의 입력값으로하여 사용자(3)의 성별 및 연령을 확률값으로 산출할 수 있다. 제1 딥러닝부(110)는 확률값에 기초하여 사용자의 성별 및 연령을 결정할 수 있다. The first deep learning unit 110 may calculate the gender and age of the user 3 as probability values by using the first voice data as an input value of the first decision algorithm stored in advance. The first deep learning unit 110 may determine the user's gender and age based on the probability value.
구체적으로, 제1 딥러닝부(111)는 제1 음성데이터를 미리 저장된 제1 판단알고리즘의 입력값으로 입력하여 사용자(3)의 성별이 남성 또는 여성 중 어느 하나이고 연령이 성인, 노인 또는 어린이 중 어느 하나에 해당할 확률을 확률값으로 산출할 수 있다. Specifically, the first deep learning unit 111 inputs the first voice data as an input value of the first decision algorithm stored in advance so that the gender of the user 3 is male or female and the age is an adult, an elderly person, or a child. A probability corresponding to any one of the above may be calculated as a probability value.
이하, 제1 딥러닝부(111)에서 제1 음성데이터를 미리 저장된 제1 판단알고리즘의 입력값으로 입력하여 산출된 확률값을 제1 확률값이라 명명하기로 한다. Hereinafter, a probability value calculated by inputting the first voice data in the first deep learning unit 111 as an input value of a first decision algorithm stored in advance will be referred to as a first probability value.
이때, 제1 딥러닝부(11)는 산출된 제1 확률값 중 가장 높은 확률값에 대응하는 성별 및 연령을 사용자(3)의 성별 및 연령으로 결정할 수 있다.At this time, the first deep learning unit 11 may determine the gender and age corresponding to the highest probability value among the calculated first probability values as the gender and age of the user 3 .
제1 딥러닝부(111)가 제1 음성데이터를 미리 저장된 판단알고리즘의 입력값으로 입력하여 사용자(3)의 성별 및 연령을 결정하는 과정은 아래, 도 2에서 구체적으로 설명하기로 한다. A process in which the first deep learning unit 111 determines the gender and age of the user 3 by inputting the first voice data as an input value of a previously stored decision algorithm will be described in detail with reference to FIG. 2 below.
음성인식기분배부(120)는 사용자(3)의 성별 및 연령에 따라서 어느 한 시점에서 다른 시점까지 사용자(3)의 요청발화를 인식하고, 사용자(3)의 요청발화에 기초하여 음성데이터를 생성하는 서로 다른 복수의 음성인식기를 포함할 수 있다. The voice recognizer distribution unit 120 recognizes the requested utterance of the user 3 from one time point to another according to the gender and age of the user 3, and generates voice data based on the user 3's requested utterance. may include a plurality of different voice recognizers.
음성인식기분배부(120)는 사용자의 성별 및 연령에 대응하는 어느 하나의 음성인식기를 결정하고, 상기 어느 하나의 음성인식기를 이용하여 어느 한 시점에서 다른 시점까지의 사용자의 요청발화를 인식할 수 있다. The voice recognizer distribution unit 120 determines one voice recognizer corresponding to the user's gender and age, and recognizes the user's requested utterance from one point in time to another point in time using the one voice recognizer. there is.
어느 하나의 음성인식기는 사용자의 요청발화를 인식하고 이에 기초하여 음성데이터를 추출할 수 있다. Any one of the voice recognizers may recognize a user's requested utterance and extract voice data based on it.
이하, 음성인식기분배부(120)에 포함된 어느 하나의 음성인식기에서 인식되어 추출된 음성데이터를 제2 음성데이터라 명명하기로 한다. Hereinafter, voice data recognized and extracted by any one voice recognizer included in the voice recognizer distribution unit 120 will be referred to as second voice data.
구체적으로, 제1 딥러닝부(110)에서 사용자(3)의 성별이 남성이고 연령이 성인으로 결정된 경우, 음성인식기분배부(120)는 성인남성의 요청발화를 인식하기 위한 어느 하나의 음성인식기를 결정할 수 있다. 어느 하나의 음성인식기는 성인남성의 요청발화를 인식하고 제2 음성데이터를 생성할 수 있다. Specifically, when the gender of the user 3 is male and the age of the user 3 is determined to be adult in the first deep learning unit 110, the voice recognizer distribution unit 120 is any one voice recognizer for recognizing the requested utterance of an adult male. can decide Any one of the voice recognizers may recognize an adult male's requested utterance and generate second voice data.
제1 딥러닝부(110)에서 사용자(3)의 성별이 여성이고 연령이 성인으로 결정된 경우, 음성인식기분배부(120)는 성인여성의 요청발화를 인식하기 위한 어느 하나의 음성인식기를 선택할 수 있다. 어느 하나의 음성인식기는 성인여성의 요청발화를 인식하고 제2 음성데이터를 생성할 수 있다. When the gender of the user 3 is female and the age of the user 3 is determined to be adult in the first deep learning unit 110, the voice recognizer distribution unit 120 may select any one voice recognizer for recognizing the requested utterance of an adult female. there is. Any one of the voice recognizers may recognize the requested utterance of an adult female and generate second voice data.
제1 딥러닝부(110)에서 사용자(3)의 성별이 남성이고 연령이 노인으로 결정된 경우, 음성인식기분배부(120)는 노인남성의 요청발화를 인식하기 위한 어느 하나의 음성인식기를 선택할 수 있다. 어느 하나의 음성인식기는 노인남성의 요청발화를 인식하고 제2 음성데이터를 생성할 수 있다. When the gender of the user 3 is determined to be male and the age of the user 3 to be elderly in the first deep learning unit 110, the voice recognizer distribution unit 120 may select one voice recognizer for recognizing the requested utterance of the elderly male. there is. Any one of the voice recognizers may recognize the requested utterance of the elderly male and generate second voice data.
제1 딥러닝부(110)에서 사용자(3)의 성별이 여성이고 연령이 노인으로 결정된 경우, 음성인식기분배부(120)는 노인여성의 요청발화를 인식하기 위한 어느 하나의 음성인식기를 선택할 수 있다. 어느 하나의 음성인식기는 노인여성의 요청발화를 인식하고 제2 음성데이터를 생성할 수 있다. When the gender of the user 3 is female and the age of the user 3 is determined to be elderly in the first deep learning unit 110, the voice recognizer distribution unit 120 may select one voice recognizer for recognizing the requested utterance of the elderly woman. there is. Any one of the voice recognizers may recognize the elderly woman's requested utterance and generate second voice data.
제1 딥러닝부(110)에서 사용자(3)의 성별이 남성이고 연령이 어린이로 결정된 경우, 음성인식기분배부(120)는 남자어린이의 요청발화를 인식하기 위한 어느 하나의 음성인식기를 선택할 수 있다. 어느 하나의 음성인식기는 남자어린이의 요청발화를 인식하고 제2 음성데이터를 생성할 수 있다. When the gender of the user 3 is male and the age of the user 3 is determined to be child in the first deep learning unit 110, the voice recognizer distribution unit 120 may select one voice recognizer for recognizing the requested utterance of the male child. there is. Any one of the voice recognizers may recognize the boy's requested utterance and generate second voice data.
제1 딥러닝부(110)에서 사용자(3)의 성별이 여성이고 연령이 어린이로 결정된 경우, 음성인식기분배부(120)는 여자어린이의 요청발화를 인식하기 위한 어느 하나의 음성인식기를 선택할 수 있다. 어느 하나의 음성인식기는 여자어린이의 요청발화를 인식하고 제2 음성데이터를 생성할 수 있다. When the gender of the user 3 is female and the age of the user 3 is determined to be a child in the first deep learning unit 110, the voice recognizer distribution unit 120 may select one voice recognizer for recognizing the requested utterance of a girl child. there is. Any one of the voice recognizers may recognize the girl child's requested utterance and generate second voice data.
즉, 음성인식기분배부(120)는 남성 또는 여성 중 어느 하나의 성별에 해당하고, 노인, 성인, 어린이 중 어느 하나의 연령에 해당하는 사용자(3)의 요청발화를 인식하기 위한 어느 하나의 음성인식기를 결정할 수 있다. That is, the voice recognizer distributing unit 120 provides any one voice for recognizing the requested utterance of the user 3 corresponding to either male or female gender and any one of the elderly, adults, and children. recognizer can be determined.
또한, 어느 하나의 음성인식기는 어느 한 시점에서 다른 시점까지의 사용자(3)의 요청발화를 인식할 수 있다. 어느 하나의 음성인식기는 사용자(3)의 요청발화에 기초하여 제2 음성데이터를 추출할 수 있다. In addition, any one voice recognizer may recognize the requested utterance of the user 3 from one point in time to another point in time. Any one of the voice recognizers may extract the second voice data based on the requested utterance of the user 3 .
음성처리부(20)는 어느 한 시점에서 다른 시점까지의 사용자의 요청발화를 텍스트로 변환하여 사용자의 발화의도를 판단할 수 있다. 음성처리부(20)는 다른 시점 이후의 사용자의 예측발화를 판단할 수 있다. The voice processing unit 20 may determine the user's speech intention by converting the user's requested speech from one point in time to another point in time into text. The voice processing unit 20 may determine the user's predicted utterance after another point in time.
음성-텍스트변환부(200)는 음성인식기분배부(120)에 포함된 어느 하나의 음성인식기에서 추출된 제2 음성데이터를 텍스트로 변환할 수 있다. 이때, 텍스트는 오류를 포함할 수 있다. 이하, 음성-텍스트변환부(200)에서 변환되며, 오류를 포함하는 텍스트를 제1 변환텍스트라 명명하기로 한다. The voice-to-text conversion unit 200 may convert the second voice data extracted from any one voice recognizer included in the voice recognizer distribution unit 120 into text. At this time, the text may contain errors. Hereinafter, text converted by the speech-to-text conversion unit 200 and including errors will be referred to as first converted text.
구체적으로, 어느 하나의 음성인식기가 사용자(3)의 요청발화를 인식하는 과정에서 외부잡음이 포함되는 등, 다양한 외부요인 또는 내부요인에 의해서 사용자(3)의 요청발화를 잘못 인식할 수 있다.Specifically, one of the voice recognizers may misrecognize the requested utterance of the user 3 due to various external or internal factors, such as the inclusion of external noise in the process of recognizing the requested utterance of the user 3.
예를 들어, 성인남성의 요청발화를 인식하는 어느 하나의 음성인식기는 성인남성의 요청발화인 '에어야 오늘날씨 알려줘'에 기초하여 '에어야 오월의 일정을 알려줘'로 잘못 인식할 수 있다. 또한, 어느 하나의 음성인식기는 이에 기초하여 '에어야 오월의 일정을 알려줘'로 구성된 제2 음성데이터를 추출할 수 있다. For example, one voice recognizer that recognizes an adult man's requested utterance may mistakenly recognize it as 'Air tell me the schedule for May' based on the adult man's requested utterance 'Air tell me today's weather'. In addition, one of the voice recognizers may extract second voice data composed of 'Air tell me the schedule for May' based on this.
이때, 음성-텍스트변환부(200)는 제2 음성데이터를 제1 변환텍스트로 변환할 수 있으며, 제1 변환텍스트는 오류를 포함한'에어야 오월의 일정을 알려줘'로 구성될 수 있다. At this time, the voice-to-text conversion unit 200 may convert the second voice data into first converted text, and the first converted text may be composed of 'Air, let me know the schedule for May' including an error.
제2 딥러닝부(210)는 제1 변환텍스트를 미리 저장된 제1 딥러닝알고리즘의 입력값으로 입력하여 제1 요청텍스트를 생성할 수 있다. The second deep learning unit 210 may generate the first request text by inputting the first converted text as an input value of the first deep learning algorithm stored in advance.
구체적으로, 제1 변환텍스트가 제2 딥러닝부(210)의 미리 저장된 제1 딥러닝알고리즘의 입력값으로 입력되는 경우, 제1 딥러닝알고리즘은 출력값으로 사용자의 발화요청에 상응하는 요청텍스트를 생성할 수 있다. Specifically, when the first converted text is input as an input value of the first deep learning algorithm stored in advance in the second deep learning unit 210, the first deep learning algorithm outputs the request text corresponding to the user's speech request as an output value. can create
이하, 제2 딥러닝부(210)에서 미리 저장된 제1 딥러닝알고리즘에 의해 생성된 요청텍스트를 제1 요청텍스트라 명명하기로 한다.Hereinafter, the request text generated by the first deep learning algorithm previously stored in the second deep learning unit 210 will be referred to as a first request text.
제2 딥러닝부(210)는 입력층, 은닉층, 및 출력층으로 구성되며, 제1 딥러닝알고리즘은 제1 변환텍스트를 입력층의 입력값으로 입력하여 출력층에서 제1 요청텍스트를 출력하는 일련의 프로세스를 의미할 수 있다. The second deep learning unit 210 is composed of an input layer, a hidden layer, and an output layer, and the first deep learning algorithm inputs the first converted text as an input value of the input layer and outputs the first request text from the output layer. can mean process.
이때, 입력층에 입력된 제1 변환텍스트는 은닉층에서 소정의 가중치값들이 가산되어 출력값인 제1 요청텍스트로 변환될 수 있다. In this case, the first converted text input to the input layer may be converted into first request text as an output value by adding predetermined weight values in the hidden layer.
제2 딥러닝부(210)의 은닉층에서 가산되는 소정의 가중치값들은 아래 도 4에서 서술할 제1 오류율 및 제2 오류율을 포함하는 가중치값에 의해서 재설정될 수 있으며, 위 과정을 통해서 제1 딥러닝알고리즘은 조정될 수 있다. The predetermined weight values added in the hidden layer of the second deep learning unit 210 may be reset by weight values including the first error rate and the second error rate to be described in FIG. 4 below, and through the above process, the first deep The learning algorithm can be tuned.
제2 딥러닝부(210)는 제1 변환텍스트와 제1 딥러닝알고리즘에서 출력된 제1 요청텍스트 사이의 오류율(또는, 손실값)을 산출할 수 있다. The second deep learning unit 210 may calculate an error rate (or loss value) between the first converted text and the first request text output from the first deep learning algorithm.
이하, 제2 딥러닝부(210)에서 제1 변환텍스트와 제1 딥러닝알고리즘에서 출력된 제1 요청텍스트 사이에 산출되는 오류율(또는, 손실값)을 제1 오류율(또는, 제1 손실값)이라 명명하기로 한다. Hereinafter, the error rate (or loss value) calculated between the first converted text in the second deep learning unit 210 and the first request text output from the first deep learning algorithm is the first error rate (or the first loss value). ) to be named.
구체적으로, 제2 딥러닝부(210)의 제1 딥러닝알고리즘의 입력값으로 제1 변환텍스트가 입력된 경우, 제1 딥러닝알고리즘에서 출력된 제1 요청텍스트는 사용자의 '발화요청'에 정확히 상응하지 않을 수 있다. Specifically, when the first converted text is input as an input value of the first deep learning algorithm of the second deep learning unit 210, the first request text output from the first deep learning algorithm corresponds to the user's 'speech request'. may not correspond exactly.
따라서, 이후 서술할 제1 딥러닝알고리즘을 미세조정하기 위해서 제2 딥러닝부(210)는 제1 딥러닝알고리즘에서 출력된 제1 요청텍스트와 제1 딥러닝알고리즘의 입력값으로 입력된 제1 변환텍스트 사이의 제1 오류율(또는, 제1 손실값)을 산출할 수 있다. Therefore, in order to fine-tune the first deep learning algorithm to be described later, the second deep learning unit 210 uses the first request text output from the first deep learning algorithm and the first request text input as input values of the first deep learning algorithm. A first error rate (or first loss value) between converted texts may be calculated.
제2 딥러닝부(210)의 제2 딥러닝알고리즘은 제1 기준텍스트(또는, 제1 전사텍스트)를 미리 저장된 제2 딥러닝알고리즘의 입력값하여 출력값이 제1 기준텍스트(또는, 제1 전사텍스트)와 동일한 제2 기준텍스트(또는, 제2 전사텍스트)를 출력하도록 제2 딥러닝알고리즘을 미세조정할 수 있다. 이때, 제1 기준텍스트(또는, 전사텍스트)는 사용자의 발화의도에 정확히 상응하는 텍스트에 해당하며 오류가 포함되지 않은 '에어야 오늘 날씨를 알려줘'에 해당한다. The second deep learning algorithm of the second deep learning unit 210 uses the first reference text (or first transcript text) as an input value of the previously stored second deep learning algorithm, and the output value is the first reference text (or first transcription text). The second deep learning algorithm may be fine-tuned to output the same second reference text (or second transcription text) as the transcription text. At this time, the first reference text (or transcription text) corresponds to a text that exactly corresponds to the user's utterance intention and corresponds to 'Air tell me today's weather' without an error.
구체적으로, 제2 딥러닝부(210)는 제1 기준텍스트(또는, 제1 전사텍스트)와 미리 저장된 제2 딥러닝알고리즘의 출력값으로 출력된 제2 기준텍스트(또는, 제2 전사텍스트) 사이의 제2 오류율(또는, 제2 손실값)을 산출할 수 있다. Specifically, the second deep learning unit 210 connects the first reference text (or first transcription text) and the second reference text (or second transcription text) output as the output value of the second deep learning algorithm stored in advance. A second error rate (or second loss value) of can be calculated.
또한, 제2 딥러닝부(210)의 제2 딥러닝알고리즘은 제2 오류율(또는, 제2 손ㅅ실값)에 기초하여 제2 딥러닝알고리즘을 미세조정하고 딥러닝을 수행할 수 있다. In addition, the second deep learning algorithm of the second deep learning unit 210 may fine-tune the second deep learning algorithm based on the second error rate (or the second loss value) and perform deep learning.
제2 딥러닝부(210)는 제1 오류율(또는, 제1 손실값)과 제2 오류율(또는, 제2 손실값)을 이용하여 가중치값을 산출할 수 있다. 제2 딥러닝부(210)는 생성된 가중치값을 이용하여 미리 저장된 제1 딥러닝알고리즘을 미세조정할 수 있다. The second deep learning unit 210 may calculate a weight value using a first error rate (or a first loss value) and a second error rate (or a second loss value). The second deep learning unit 210 may fine-tune the previously stored first deep learning algorithm using the generated weight values.
위 과정을 통해서, 잡음이 존재하는 환경 또는 음성인식기가 사용자의 사용발화를 잘못인식한 경우에도 제2 딥러닝부(210)의 제1 딥러닝알고리즘은 사용자의 요청발화를 정확하게 인식하고 텍스트(또는, 제1 요청텍스트)로 변환할 수 있다. Through the above process, even in a noisy environment or when the voice recognizer misrecognizes the user's use utterance, the first deep learning algorithm of the second deep learning unit 210 accurately recognizes the user's requested utterance and text (or , the first request text).
발화의도분류부(220)는 제1 요청텍스트를 미리 저장된 의도분류알고리즘의 입력값으로 입력하여 사용자의 발화의도에 대한 발화의도데이터를 생성할 수 있다. The speech intention classification unit 220 may generate speech intention data for the user's speech intention by inputting the first request text as an input value of the intention classification algorithm stored in advance.
구체적으로, 발화의도분류부(220)는 제1 요청텍스트인 '에어야 오늘 날씨를 알려줘'를 미리 저장된 의도분류알고리즘의 입력값으로 입력하여 사용자의 발화의도를 확률값으로 산출할 수 있다. Specifically, the speech intention classification unit 220 may calculate the user's speech intention as a probability value by inputting the first request text 'Air, tell me the weather today' as an input value of a pre-stored intention classification algorithm.
이하, 발화의도분류부(220)에서 산출된 확률값을 제2 확률값이라 명명하기로 한다.Hereinafter, the probability value calculated by the speech intention classification unit 220 will be referred to as a second probability value.
이때, 발화의도분류부(220)는 산출된 제2 확률값 중 가장 높은 확률값에 대응하는 발화의도를 사용자의 발화의도로 결정할 수 있다. 예를 들어, 발화의도분류부(220)는 사용자의 발화의도를 '오늘'의 '날씨'인 것으로 판단할 수 있다.At this time, the speech intention classification unit 220 may determine the speech intention corresponding to the highest probability value among the calculated second probability values as the user's speech intention. For example, the speech intention classification unit 220 may determine that the user's speech intention is 'weather' of 'today'.
발화의도분류부(220)에 제1 요청텍스트를 미리 저장된 의도분류알고리즘의 입력값으로하여 사용자의 발화의도를 확률값으로 산출하고, 이에 기초하여 사용자의 발화의도를 결정하는 과정은 아래 도 5에서 구체적으로 서술하기로 한다. The process of calculating the user's speech intention as a probability value by using the first request text as an input value of the intention classification algorithm stored in advance in the speech intention classification unit 220, and determining the user's speech intention based on this is shown below. It will be described in detail in 5.
발화의도예측부(230)는 발화의도데이터를 미리 저장된 의도예측알고리즘의 입력값으로 입력하여 사용자의 예측발화에 대한 예측발화데이터를 생성할 수 있다.The speech intention prediction unit 230 may generate predicted speech data for the user's predicted speech by inputting speech intention data as an input value of an intention prediction algorithm stored in advance.
구체적으로, 발화의도예측부(230)는 발화의도데이터인 '오늘' 및 '날씨'를 미리 저장된 의도예측알고리즘의 입력값으로 입력하여 사용자의 예측발화를 확률값으로 산출할 수 있다. 이하, 발화의도예측부(230)에서 산출된 확률값을 제3 확률값이라 명명하기로 한다. Specifically, the speech intention prediction unit 230 may calculate the user's predicted speech as a probability value by inputting 'today' and 'weather', which are speech intention data, as input values of an intention prediction algorithm stored in advance. Hereinafter, the probability value calculated by the speech intention prediction unit 230 will be referred to as a third probability value.
이때, 발화의도예측부(230)는 산출된 제3 확률값 중 가장 높은 확률값에 대응하는 예측발화를 사용자의 예측발화로 결정할 수 있다. In this case, the utterance intention prediction unit 230 may determine the predicted utterance corresponding to the highest probability value among the calculated third probability values as the user's predicted utterance.
예를 들어, 발화의도예측부(230)는 사용자(3)의 요청발화가 끝난 다른 시점 이후인 제1 시점의 사용자의 예측발화를 '옷'으로 결정할 수 있다. 또한, 발화의도예측부(230)는 제1 시점 이후인 제2 시점의 사용자의 예측발화를 '장소'로 결정할 수 있다. For example, the speech intention prediction unit 230 may determine 'clothes' as the user's predicted utterance at a first point in time, which is after another point in time when the user 3's requested utterance ends. In addition, the speech intention prediction unit 230 may determine the user's predicted speech at a second time point after the first time point as a 'place'.
발화의도예측부(230)에 발화의도데이터를 미리 저장된 의도예측알고리즘의 입력값으로하여 사용자의 예측발화를 확률값으로 산출하고, 이에 기초하여 사용자의 예측발화를 결정하는 과정은 아래, 도 5에서 구체적으로 서술하기로 한다. The process of calculating the predicted utterance of the user as a probability value by using the utterance intention data stored in advance in the utterance intention prediction unit 230 as an input value of the intention prediction algorithm, and determining the user's predicted utterance based on this is shown in FIG. 5 below. to be described in detail.
즉, 상술한 바와 같이, 발화의도분류부(220)는 제1 요청텍스트에 기초하여 어느 한 시점에서 다른 시점까지의 사용자의 발화의도를 판단하고 발화의도데이터를 생성할 수 있다. 발화의도예측부(230)는 발화의도데이터에 기초하여 사용자(3)의 발화요청이 끝난 다른 시점 이후의 사용자의 예측발화를 판단하고 예측발화데이터를 생성할 수 있다. That is, as described above, the speech intention classification unit 220 may determine the user's speech intention from one point in time to another point in time based on the first request text and generate speech intention data. Based on the speech intention data, the speech intention prediction unit 230 may determine the user's predicted speech after another point in time when the user 3's speech request ends, and generate predicted speech data.
응답생성부(30)는 발화의도분류부(220)에서 판단된 사용자의 발화의도와 발화의도예측부(230)에서 판단된 사용자의 예측발화에 기초하여 사용자의 요청발화에 대한 응답을 생성할 수 있다.The response generation unit 30 generates a response to the user's requested utterance based on the user's speech intention determined by the speech intention classification unit 220 and the user's predicted speech determined by the speech intention prediction unit 230. can do.
구체적으로, 응답텍스트생성부(300)는 발화의도데이터 및 예측발화데이터를 미리 저장된 응답알고리즘의 입력값으로 입력하여 사용자의 요청발화에 대한 응답을 생성할 수 있다. Specifically, the response text generator 300 may generate a response to the user's requested utterance by inputting utterance intention data and predicted utterance data as input values of a response algorithm stored in advance.
예를 들어, 사용자의 발화의도는 '오늘'의 '날씨'에 대한 것이고, 사용자의 예측발화는 '옷' 및 '장소'에 대한 것으로 판단되므로, 응답텍스트생성부(300)는 '오늘'및 '날씨'를 포함하는 발화의도데이터와 '옷', 및 '장소'를 포함하는 예측발화데이터를 미리 저장된 응답알고리즘의 입력값으로 입력할 수 있다.For example, since the user's utterance intention is about 'weather' of 'today' and the user's predicted utterance is about 'clothes' and 'place', the response text generator 300 selects 'today' Speech intention data including 'weather' and predicted speech data including 'clothes' and 'place' may be input as input values of a previously stored response algorithm.
이에 기초하여, 미리 저장된 응답알고리즘의 출력값으로 '오늘의 날씨는 덥습니다. 냉방병 예방을 위해 얇고 긴 옷을 입고 밖으로 나가실 것을 추천드립니다.'라는 응답텍스트를 출력값으로 생성할 수 있다.Based on this, 'today's weather is hot' as the output value of the pre-stored response algorithm. It is recommended to go outside wearing thin and long clothes to prevent air-conditioning disease.' can be generated as an output value.
즉, 응답텍스트생성부(300)는 '오늘의 날씨는 덥습니다. 냉방병 예방을 위해 얇고 긴 옷을 입고 밖으로 나가실 것을 추천드립니다.'라는 응답텍스트를 생성할 수 있다. That is, the response text generating unit 300 says, 'Today's weather is hot. It is recommended to go outside wearing thin and long clothes to prevent air-conditioning sickness.'
텍스트-음성변환부(310)는 응답텍스트생성부(300)에서 생성된 응답텍스트를 음성데이터로 변환할 수 있다. 텍스트-음성변환부(310)는 음성데이터를 인공지능스피커(2)에 전달하고, 이에 기초하여 사용자(3)의 요청발화에 대한 응답으로 출력할 수 있다. The text-to-speech conversion unit 310 may convert the response text generated by the response text generation unit 300 into voice data. The text-to-speech conversion unit 310 may transmit voice data to the artificial intelligence speaker 2 and output the result as a response to the user 3's requested utterance.
도 2는 본 발명의 한 실시예에 따른 사용자의 성별 및 연령을 판단하고 제2 음성데이터를 추출하는 과정을 설명하는 도면이다. 2 is a diagram explaining a process of determining the gender and age of a user and extracting second voice data according to an embodiment of the present invention.
음성추출부(100)는 사용자(3)의 요청발화에 기초하여 어느 한 시점에서의 제1 음성데이터('에어야')를 추출할 수 있다. 제1 딥러닝부(110)는 제1 음성데이터('에어야')를 미리 저장된 제1 판단알고리즘의 입력값으로하여 사용자(3)의 성별 및 연령을 확률값으로 산출할 수 있다. The voice extraction unit 100 may extract first voice data ('Air Ya') at any one point in time based on the requested utterance of the user 3. The first deep learning unit 110 may calculate the gender and age of the user 3 as probability values by using the first voice data ('Air') as an input value of the first decision algorithm stored in advance.
이하, 도 2에서 음성추출부(100)는 사용자(3)의 요청발화에 기초하여 제1 음성데이터('에어야')를 추출하고, 제1 딥러닝부(110)에서 제1 음성데이터('에어야')가 미리 저장된 제1 판단알고리즘의 입력값으로 입력된 것으로 가정하기로 한다.Hereinafter, in FIG. 2, the voice extraction unit 100 extracts first voice data ('Air Ya') based on the user 3's requested utterance, and the first deep learning unit 110 extracts the first voice data ( It is assumed that 'Air') is input as an input value of the first decision algorithm stored in advance.
제1 음성데이터('에어야')가 미리 저장된 제1 판단알고리즘의 입력값으로 입력된 경우, 출력값으로 사용자(3)의 성별이 남성이고 연령이 성인일 제1 확률값을 0.9로 산출할 수 있다. 즉, 제1 딥러닝부(110)는 사용자(3)의 성별이 남성이고 연령이 성인일 제1 확률값을 0.9로 판단할 수 있다. When the first voice data ('Ayeah') is input as an input value of the first decision algorithm stored in advance, a first probability value that the user 3 is male and the age is an adult can be calculated as 0.9 as an output value. . That is, the first deep learning unit 110 may determine that the first probability value that the user 3 has a male gender and an adult age is 0.9.
제1 음성데이터('에어야')가 미리 저장된 제1 판단알고리즘의 입력값으로 입력된 경우, 출력값으로 사용자(3)의 성별이 여성이고 연령이 성인일 제1 확률값을 0.02로 산출할 수 있다. 즉, 제1 딥러닝부(110)는 사용자(3)의 성별이 여성이고 연령이 성인일 제1 확률값을 0.02로 산출할 수 있다.When the first voice data ('Ayeah') is input as an input value of the first decision algorithm stored in advance, a first probability value that the user 3 has a female gender and an adult age can be calculated as 0.02 as an output value. . That is, the first deep learning unit 110 may calculate a first probability value that the user 3 has a female gender and an adult age as 0.02.
제1 음성데이터('에어야')가 미리 저장된 제1 판단알고리즘의 입력값으로 입력된 경우, 출력값으로 사용자(3)의 성별이 남성이고 연령이 노인일 제1 확률값을 0.03으로 산출할 수 있다. 즉, 제1 딥러닝부(110)는 사용자(3)의 성별이 남성이고 연령이 노인일 제1 확률값을 0.03으로 산출할 수 있다. When the first voice data ('Aya') is input as an input value of the first decision algorithm stored in advance, a first probability value that the user 3 is a male and an elderly person can be calculated as 0.03 as an output value. That is, the first deep learning unit 110 may calculate a first probability value of 0.03 when the gender of the user 3 is male and the age is an elderly person.
제1 음성데이터('에어야')가 미리 저장된 제1 판단알고리즘의 입력값으로 입력된 경우, 출력값으로 사용자(3)의 성별이 여성이고 연령이 노인일 제1 확률값을 0.02으로 산출할 수 있다. 즉, 제1 딥러닝부(110)는 사용자(3)의 성별이 여성이고 연령이 노인일 제1 확률값을 0.02으로 산출할 수 있다.When the first voice data ('Aya') is input as an input value of the first decision algorithm stored in advance, a first probability value that the user 3 is a woman and an elderly person can be calculated as 0.02 as an output value. That is, the first deep learning unit 110 may calculate a first probability value that the user 3 has a female gender and an elderly age as 0.02.
제1 음성데이터('에어야')가 미리 저장된 제1 판단알고리즘의 입력값으로 입력된 경우, 출력값으로 사용자(3)의 성별이 남성이고 연령이 어린이일 제1 확률값을 0.02으로 산출할 수 있다. 즉, 제1 딥러닝부(110)는 사용자(3)의 성별이 남성이고 연령이 어린이일 제1 확률값을 0.02으로 산출할 수 있다. When the first voice data ('Aya') is input as an input value of the first decision algorithm stored in advance, a first probability value that the user 3 is male and the age is a child can be calculated as 0.02 as an output value. . That is, the first deep learning unit 110 may calculate a first probability value that the user 3 has a male gender and a child age as 0.02.
제1 음성데이터('에어야')가 미리 저장된 제1 판단알고리즘의 입력값으로 입력된 경우, 출력값으로 사용자(3)의 성별이 여성이고 연령이 어린이일 제1 확률값을 0.01으로 산출할 수 있다. 즉, 제1 딥러닝부(110)는 사용자(3)의 성별이 여성이고 연령이 어린이일 제1 확률값을 0.01으로 산출할 수 있다.When the first voice data ('Ayeah') is input as an input value of the first decision algorithm stored in advance, a first probability value that the user 3 has a female gender and a child age can be calculated as 0.01 as an output value. . That is, the first deep learning unit 110 may calculate a first probability value that the user 3 has a female gender and a child age as 0.01.
이때, 제1 딥러닝부(110)는 제1 확률값 중 가장 높은 확률값(0.9)에 대응하는 성별(남성) 및 연령(성인)을 사용자(3)의 성별 및 연령으로 결정할 수 있다. At this time, the first deep learning unit 110 may determine the gender (male) and age (adult) corresponding to the highest probability value (0.9) among the first probability values as the gender and age of the user 3 .
음성인식기분배부(120)는 성인남성의 요청발화를 인식하기 위한 어느 하나의 음성인식기를 결정할 수 있다. 음성인식기분배부(120)에 의해 선택된 어느 하나의 음성인식기는 어느 한 시점부터 다른 시점까지의 사용자의 요청발화를 인식하고 이에 기초하여 제2 음성데이터를 추출할 수 있다. The voice recognizer distribution unit 120 may determine any one voice recognizer for recognizing an adult male's requested utterance. Any one voice recognizer selected by the voice recognizer distribution unit 120 may recognize a user's requested utterance from a certain point in time to another point in time, and extract second voice data based thereon.
즉, 제1 딥러닝부(110)에서 사용자(3)의 성별 및 연령이 성인남성으로 결정된 경우, 어느 하나의 음성인식기는 성인남성인 사용자(3)의 요청발화를 인식하고 이에 기초하여 제2 음성데이터를 추출할 수 있다. That is, when the gender and age of the user 3 are determined to be an adult male in the first deep learning unit 110, one of the voice recognizers recognizes the requested utterance of the adult male user 3, and based on this, the second Voice data can be extracted.
도 3은 본 발명의 한 실시예에 따른 사용자의 요청발화와 제1 변환텍스트를 설명하는 도면이다. 3 is a diagram explaining a user's request utterance and first converted text according to an embodiment of the present invention.
사용자(3)의 성별 및 연령에 기초하여 음성인식기분배부(120)는 사용자(3)의 요청발화를 인식하기 위한 어느 하나의 음성인식기를 결정할 수 있다. 어느 하나의 음성인식기는 사용자(3)의 요청발화를 인식하고 이에 기초하여 제1 변환텍스트로 변할 수 있다.Based on the gender and age of the user 3, the voice recognizer distribution unit 120 may determine one of the voice recognizers for recognizing the requested utterance of the user 3. Any one of the voice recognizers may recognize the requested utterance of the user 3 and change the first converted text based thereon.
이때, 어느 하나의 음성인식기가 사용자(3)의 요청발화를 인식하는 과정에서 외부잡음이 포함되는 등, 다양한 외부요인 또는 내부요인에 의해서 사용자(3)의 요청발화를 잘못 인식할 수 있다. At this time, any one voice recognizer may misrecognize the requested utterance of the user 3 due to various external or internal factors, such as the inclusion of external noise in the process of recognizing the requested utterance of the user 3 .
도 3을 참고하면, 성인남성인 사용자(3)의 요청발화가 '에어야 오늘 날씨 알려줘'인 경우, 어느 하나의 음성인식기는 사용자(3)의 요청발화를 '에어야 오월의 일정을 예약해줘'로 잘못 인식할 수 있다. Referring to FIG. 3 , when the request utterance of the adult male user 3 is 'Air, tell me the weather today', one of the voice recognizers responds to the request utterance of the user 3 as 'Air, tell me the weather today' ' can be misinterpreted.
어느 하나의 음성인식기는 이에 기초하여 '에어야 오월의 일정을 예약해줘'를 포함하는 제2 음성데이터를 추출할 수 있다. 어느 하나의 음성인식기는 제2 음성데이터를 '에어야 오월의 일정을 예약해줘'로 구성된 제1 변환텍스트로 변환할 수 있다.Based on this, one voice recognizer may extract second voice data including 'Air, make a reservation for May'. Any one of the voice recognizers may convert the second voice data into first converted text consisting of 'Air, make a reservation for May'.
또는, 성인 여성인 사용자(3)의 요청발화가 '에어야 우리딸 진희에게 전화해줘'인 경우, 어느 하나의 음성인식기는 사용자(3)의 요청발화를 '에어야 ## 지니에게 전해줘'로 잘못 인식할 수 있다.Alternatively, when the request utterance of the adult female user 3 is 'Air, call my daughter Jin-hee', one voice recognizer converts the user 3's request utterance to 'Air, ## tell Genie'. may be misinterpreted.
어느 하나의 음성인식기는 이에 기초하여 '에어야 ## 지니에게 전해줘'를 포함하는 제2 음성데이터를 추출할 수 있다. 어느 하나의 음성인식기는 제2 음성데이터를 '에어야 ## 지니에게 전해줘'로 구성된 제1 변환텍스트로 변환할 수 있다. Based on this, any one voice recognizer may extract second voice data including 'Air ## Tell it to Genie'. Any one of the voice recognizers may convert the second voice data into first converted text composed of 'Air ## Tell it to Genie'.
또는, 어린이 남성인 사용자(3)의 요청발화가 '에어야 뽀로로 6시에 재방송 틀어줘'인 경우, 어느 하나의 음성인식기는 사용자(3)의 요청발화를 '에어야 뽑기놀이 6시에 틀어줘'로 잘못 인식할 수 있다.Alternatively, if the requested utterance of the user (3), a child and a male, is 'Play Air Pororo at 6 o'clock', one of the voice recognizers sends the requested utterance of the user (3) to 'Play Air Pororo at 6 o'clock' It can be misinterpreted as 'give'.
어느 하나의 음성인식기는 이에 기초하여 '에어야 뽑기놀이 6시에 틀어줘'를 포함하는 제2 음성데이터를 추출할 수 있다. 어느 하나의 음성인식기는 제2 음성데이터를 '에어야 뽑기놀이 6시에 틀어줘'로 구성된 제1 변환텍스트로 변환할 수 있다.Based on this, one of the voice recognizers may extract second voice data including 'play the air lottery game at 6 o'clock'. Any one of the voice recognizers may convert the second voice data into a first converted text consisting of 'play the air lottery game at 6 o'clock'.
또는, 어린이 여성인 사용자(3)의 요청발화가 '에어야 영어공부하는 유투브 재생목록에 넣어줘'인 경우, 어느 하나의 음성인식기는 사용자(3)의 요청발화를 '에어야 ##공부하는 유투브 재생하지마'로 잘못 인식할 수 있다.Alternatively, when the request utterance of the user 3, a child or a woman, is 'Air should put it in the YouTube playlist for studying English', one voice recognizer responds to the request utterance of the user 3 as 'Air should ## study English'. It can be mistakenly recognized as 'don't play YouTube'.
어느 하나의 음성인식기는 이에 기초하여 '에어야 ##공부하는 유투브 재생하지마'를 포함하는 제2 음성데이터를 추출할 수 있다. 어느 하나의 음성인식기는 제2 음성데이터를 '에어야 ##공부하는 유투브 재생하지마'로 구성된 제1 변환텍스트로 변환할 수 있다.Based on this, one of the voice recognizers may extract second voice data including 'Air ## do not play YouTube to study'. Any one of the voice recognizers may convert the second voice data into first converted text composed of 'Don't play ##StudyingYouTube.'
상술한 바와 같이, 어느 하나의 음성인식기가 사용자(3)의 요청발화를 인식하는 과정에서 외부잡음이 포함되는 등, 다양한 외부요인 또는 내부요인에 의해서 사용자(3)의 요청발화를 잘못 인식할 수 있다. 이에 기초하여 제2 음성데이터를 추출하고 제1 변환텍스트를 생성할 수 있다. As described above, any one voice recognizer may misrecognize the requested utterance of the user 3 due to various external or internal factors, such as the inclusion of external noise in the process of recognizing the requested utterance of the user 3. there is. Based on this, second voice data may be extracted and first converted text may be generated.
도 4는 본 발명의 한 실시예에 따른 제2 딥러닝부를 설명하는 도면이다. 4 is a diagram illustrating a second deep learning unit according to an embodiment of the present invention.
이하, 도 4 및 도 5에서 사용자(3)의 성별 및 연령은 남성 및 성인이고, 사용자(3)의 요청발화는 '에어야 오늘날씨 알려줘'인 것으로 가정하고 설명하기로 한다. Hereinafter, it is assumed that the gender and age of the user 3 in FIGS. 4 and 5 are male and adult, and the request utterance of the user 3 is 'Air tell me today'.
제2 딥러닝부(210)는 변환텍스트입력부(211), 제1 음성모델딥러닝부(212), 요청텍스트출력부(213), 기준텍스트저장부(214), 제2 음성모델딥러닝부(215), 가중치값산출부(216), 제1 오류율산출부(2120), 및 제2 오류율값산출부(2150)를 포함할 수 있다. The second deep learning unit 210 includes a converted text input unit 211, a first speech model deep learning unit 212, a request text output unit 213, a reference text storage unit 214, and a second speech model deep learning unit. 215, a weight value calculation unit 216, a first error rate calculation unit 2120, and a second error rate calculation unit 2150 may be included.
변환텍스트입력부(211)는 어느 하나의 음성인식기에서 변환된 제1 변환텍스트를 입력받을 수 있다.The converted text input unit 211 may receive the first converted text converted by any one voice recognizer.
구체적으로, 변환텍스트입력부(211)는 '에어야 오월의 일정을 예약해줘'로 구성된 제1 변환텍스트를 입력받을 수 있다. Specifically, the converted text input unit 211 may receive a first converted text consisting of 'Air, make a reservation for May'.
제1 음성모델딥러닝부(212)에서 제1 변환텍스트는 미리 저장된 제1 딥러닝알고리즘의 입력값으로 입력될 수 있다. 제1 음성모델딥러닝부(212)에 미리 저장된 제1 딥러닝알고리즘은 출력값으로 제1 요청텍스트를 생성할 수 있다. In the first voice model deep learning unit 212, the first converted text may be input as an input value of a first deep learning algorithm stored in advance. The first deep learning algorithm previously stored in the first voice model deep learning unit 212 may generate the first request text as an output value.
제1 음성모델딥러닝부(212)는 이하 서술할 가중치값산출부(216)에서 제공된 가중치값을 이용하여 제1 딥러닝알고리즘을 조정할 수 있다. 제1 음성모델딥러닝부(212)는 제1 변환텍스트를 조정된 제1 딥러닝알고리즘의 입력값으로 입력하고 제1 요청텍스트를 출력하여 딥러닝을 수행할 수 있다. The first voice model deep learning unit 212 may adjust the first deep learning algorithm using the weight values provided by the weight value calculation unit 216 to be described below. The first voice model deep learning unit 212 may perform deep learning by inputting the first converted text as an input value of the adjusted first deep learning algorithm and outputting the first request text.
제1 오류율산출부(2120)는 제1 음성모델딥러닝부(212)에서 생성된 제1 음성모델딥러닝부(212)에 입력된 제1 변환텍스트와 제1 딥러닝알고리즘에서 출력된 제1 요청텍스트사이의 제1 오류율(또는, 제1 손실값)을 산출할 수 있다. 제1 오류율산출부(2120)는 가중치값산출부(216)에 제1 오류율(또는, 제1 손실값)을 제공할 수 있다. The first error rate calculation unit 2120 includes the first converted text input to the first voice model deep learning unit 212 generated by the first speech model deep learning unit 212 and the first converted text output from the first deep learning algorithm. A first error rate (or a first loss value) between request texts may be calculated. The first error rate calculation unit 2120 may provide the first error rate (or first loss value) to the weight value calculation unit 216 .
요청텍스트출력부(213)는 제1 음성모델딥러닝부(212)에 미리 저장되며 미세조정된 제1 딥러닝알고리즘에서 생성된 제1 요청텍스트('에어야 오늘 날씨 알려줘')를 응답생성부(30, 도 1 참고)에 제공할 수 있다. The request text output unit 213 transmits the first request text ('Air, tell me today's weather') generated by the first deep learning algorithm that is pre-stored in the first voice model deep learning unit 212 and finely tuned to the response generator. (30, see FIG. 1).
기준텍스트저장부(214)는 제1 기준텍스트(또는, 제1 전사텍스트)를 미리 저장할 수 있다. The reference text storage unit 214 may store the first reference text (or first transcription text) in advance.
구체적으로, 기준텍스트저장부(214)는 사용자의 요청발화에 상응하며 오류를 포함하지 않는'에어야 오월의 일정을 예약해줘'로 구성된 제1 기준텍스트(또는, 제1 전사텍스트)를 미리 저장할 수 있다.Specifically, the reference text storage unit 214 stores in advance a first reference text (or first transcription text) composed of 'Air, make a reservation for May', which corresponds to the user's requested utterance and does not include an error. can
제2 음성모델딥러닝부(215)는 기준텍스트저장부(214)로부터 제1 기준텍스트(또는, 제1 전사텍스트)를 제공받을 수 있다. 제2 음성모델딥러닝부(215)는 제1 기준텍스트(또는, 제1 전사텍스트)를 미리 저장된 제2 딥러닝알고리즘의 입력값으로 입력하고, 제1 기준텍스트(또는, 제1 전사텍스트)와 동일한 제2 기준텍스트(또는, 제2 전사텍스트)를 출력하도록 제2 딥러닝알고리즘을 미세조정하고 딥러닝을 수행할 수 있다. The second voice model deep learning unit 215 may receive the first reference text (or first transcription text) from the reference text storage unit 214 . The second voice model deep learning unit 215 inputs the first reference text (or first transcription text) as an input value of the second deep learning algorithm stored in advance, and uses the first reference text (or first transcription text) The second deep learning algorithm may be fine-tuned to output the same second reference text (or second transcript text) and deep learning may be performed.
구체적으로, 제2 음성모델딥러닝부(215)는 기준텍스트저장부(214)로부터 제공받은 제1 기준텍스트(또는, 제1 전사텍스트)를 미리 저장된 제2 딥러닝알고리즘의 입력값으로하고, 제1 기준텍스트(또는, 제1 전사텍스트)와 동일한 제2 기준텍스트(또는, 제2 전사텍스트)를 미리 저장된 제2 딥러닝알고리즘의 출력값으로하여 딥러닝을 수행할 수 있다. Specifically, the second voice model deep learning unit 215 sets the first reference text (or first transcription text) provided from the reference text storage unit 214 as an input value of the second deep learning algorithm stored in advance, Deep learning may be performed by using the same second reference text (or second transcription text) as the first reference text (or first transcription text) as an output value of the second deep learning algorithm stored in advance.
제2 오류율산출부(2150)는 제2 음성모델딥러닝부(215)의 미리 저장된 제2 딥러닝알고리즘에 입력된 제1 기준텍스트(또는, 제1 전사텍스트)와 제1 기준텍스트(또는, 제1 전사텍스트)가 입력되어 미리 저장된 제2 딥러닝알고리즘에서 출력된 제2 기준텍스트(또는, 제2 전사텍스트) 사이의 제2 오류율을 산출할 수 있다.The second error rate calculation unit 2150 includes the first reference text (or first transcription text) and the first reference text (or, A second error rate between the second reference text (or the second transcription text) outputted from the pre-stored second deep learning algorithm after inputting the first transcription text may be calculated.
제2 오류율산출부(2150)는 제2 오류율을 가중치값산출부(216)에 제공할 수 있다. The second error rate calculation unit 2150 may provide the second error rate to the weight value calculation unit 216 .
가중치값산출부(216)는 제1 오류율값과 제2 오류율값에 기초하여 제1 음성모델딥러닝부(212)의 제1 딥러닝알고리즘을 미세조정하기 위한 가중치값을 산출할 수 있다. The weight value calculation unit 216 may calculate weight values for fine-tuning the first deep learning algorithm of the first voice model deep learning unit 212 based on the first error rate value and the second error rate value.
구체적으로, 가중치값은 하기 [수학식 1]로 표현될 수 있다.Specifically, the weight value may be expressed as [Equation 1] below.
가중치값 = a*제1 오류율값 + b* 제2 오류율값 Weight value = a*first error rate value + b*second error rate value
단, a, b는 0.5However, a and b are 0.5
가중치값산출부(216)는 가중치값을 제1 음성모델딥러닝부(212)로 제공할 수 있다. 가중치값산출부(216)에서 제공된 가중치값을 제1 음성모델딥러닝부(212)의 가중치값으로 결정하여 제1 딥러닝알고리즘을 조정하고, 제1 변환텍스트를 조정된 제1 딥러닝알고리즘의 입력값으로 입력하여 제1 음성모델딥러닝부(212)를 딥러닝시킬 수 있다.The weight value calculation unit 216 may provide the weight values to the first voice model deep learning unit 212 . The weight value provided by the weight value calculation unit 216 is determined as the weight value of the first voice model deep learning unit 212 to adjust the first deep learning algorithm, and the first converted text is used to determine the weight value of the adjusted first deep learning algorithm. The first voice model deep learning unit 212 may be deep-learned by inputting an input value.
즉, 가중치값산출부(216)로부터 제공된 가중치값은 은닉층에서 가산되는 가중치값들로 결정될 수 있다. 위 과정을 통해서 제1 딥러닝알고리즘을 미세조정할 수 있다. That is, the weight values provided from the weight value calculator 216 may be determined as weight values added in the hidden layer. Through the above process, the first deep learning algorithm can be fine-tuned.
상술한 바와 같이 도 5를 참고하면, 본 발명의 한 실시예에 따른 제1 음성모델딥러닝부(212)는 오류가 포함된 제1 변환텍스트를 미리 저장된 제1 딥러닝알고리즘의 입력값으로 입력하여 오류가 제거되어 사용자의 요청발화에 정확히 상응하는 제1 요청텍스트를 결정할 수 있다. As described above, referring to FIG. 5 , the first voice model deep learning unit 212 according to an embodiment of the present invention inputs the first converted text containing errors as an input value of the first deep learning algorithm stored in advance. Thus, the error is removed, and the first request text exactly corresponding to the user's request utterance can be determined.
추가로, 본 발명의 한 실시예에 따른 제1 음성모델딥러닝부(212)는 오류가 포함된 제1 변환텍스트와 제1 요청텍스트 사이에 산출된 제1 오류율값(또는, 제1 손실값) 및 제1 기준텍스트(또는, 제1 전사텍스트)를 입력값 및 제2 기준텍스트(또는, 제2 전사텍스트)를 출력값으로 하여 산출된 제2 오류율값을 포함하는 가중치값을 제1 음성모델딥러닝부(212)의 가중치값으로 하여 제1 딥러닝알고리즘을 미세조정할 수 있다. In addition, the first voice model deep learning unit 212 according to an embodiment of the present invention is a first error rate value (or a first loss value) calculated between the first converted text containing errors and the first request text. ) and the first reference text (or first transcription text) as an input value and the second reference text (or second transcription text) as an output value, and a weight value including a second error rate value calculated as a first voice model The first deep learning algorithm may be fine-tuned using the weight value of the deep learning unit 212 .
또한, 제1 음성모델딥러닝부(212)는 제1 변환테스트를 조정된 제1 딥러닝알고리즘의 입력값으로 입력하고 사용자의 요청발화에 정확히 상응하는 제1 요청텍스트를 생성할 수 있다. In addition, the first voice model deep learning unit 212 may input the first conversion test as an input value of the adjusted first deep learning algorithm and generate a first request text that exactly corresponds to the user's request utterance.
이를 통해, 제1 딥러닝알고리즘의 정확도를 높일 수 있으며, 오류가 포함된 제1 변환텍스트가 입력되더라도, 오류가 제거되어 사용자의 요청발화에 정확하게 대응하는 제1 요청텍스트를 결정할 수 있다. Through this, it is possible to increase the accuracy of the first deep learning algorithm, and even if the first conversion text containing an error is input, the error can be removed and the first request text that accurately corresponds to the user's request utterance can be determined.
도 5는 본 발명의 한 실시예에 따른 사용자의 발화의도 및 예측발화를 판단하고 응답을 생성하는 과정을 설명하는 도면이다. 5 is a diagram explaining a process of determining a user's utterance intention and predicted utterance and generating a response according to an embodiment of the present invention.
발화의도분류부(220)는 사용자의 요청발화에 상응하는 제1 요청텍스트를 입력받을 수 있다. 발화의도분류부(220)는 제1 요청텍스트를 미리 저장된 의도분류알고리즘의 입력값으로 입력하여 사용자의 발화의도를 제2 확률값으로 산출할 수 있다. 발화의도분류부(220)는 제2 확률값에 기초하여 사용자의 발화요청에 기초한 발화의도를 결정할 수 있다.The speech intention classification unit 220 may receive a first request text corresponding to the user's requested speech. The speech intention classification unit 220 may calculate the user's speech intention as a second probability value by inputting the first request text as an input value of the intention classification algorithm stored in advance. The speech intention classification unit 220 may determine the speech intention based on the user's speech request based on the second probability value.
도 5를 참고하면, 발화의도분류부(220)는 제1 요청텍스트('에어야 오늘 날씨 알려줘')를 미리 저장된 의도분류알고리즘의 입력값으로 입력하여 사용자의 발화의도를 제2 확률값으로 산출할 수 있다. Referring to FIG. 5 , the speech intention classification unit 220 inputs the first request text ('Let me know the weather today') as an input value of a pre-stored intention classification algorithm, and converts the user's speech intention into a second probability value. can be calculated
구체적으로, 제1 요청텍스트가 미리 저장된 의도분류알고리즘의 입력값으로 입력된 경우, 발화의도분류부(220)는 사용자의 발화의도가 '날씨'일 제2 확률값을 0.9로 산출할 수 있다. 발화의도분류부(220)는 사용자의 발화의도가 '옷'일 제2 확률값을 0.1로 산출할 수 있다. Specifically, when the first request text is input as an input value of the intention classification algorithm stored in advance, the speech intention classification unit 220 may calculate a second probability value that the user's speech intention is 'weather' as 0.9. . The speech intention classification unit 220 may calculate a second probability value that the user's speech intention is 'clothes' as 0.1.
발화의도분류부(220)는 제2 확률값 중 가장 높은 확률값(0.9)에 대응하는 발화의도('날씨')를 사용자의 발화요청에 기초한 발화의도로 결정할 수 있다. 발화의도분류부(220)는 사용자의 발화의도에 기초하여 '날씨'를 포함하는 발화의도데이터를 생성할 수 있다. The speech intention classification unit 220 may determine the speech intention ('weather') corresponding to the highest probability value (0.9) among the second probability values as the speech intention based on the user's speech request. The speech intention classification unit 220 may generate speech intention data including 'weather' based on the user's speech intention.
발화의도예측부(230)는 사용자의 발화의도에 대응하는 발화의도데이터를 입력받을 수 있다. 발화의도예측부(230)는 발화의도데이터를 미리 저장된 의도예측알고리즘의 입력값으로 입력하여 사용자의 예측발화를 제3 확률값으로 산출할 수 있다. 발화의도예측부(230)는 제3 확률값에 기초하여 사용자의 예측발화를 결정할 수 있다.The speech intention prediction unit 230 may receive speech intention data corresponding to the user's speech intention. The speech intention prediction unit 230 may calculate the user's predicted speech as a third probability value by inputting the speech intention data as an input value of the intention prediction algorithm stored in advance. The speech intention prediction unit 230 may determine the user's predicted speech based on the third probability value.
도 5를 참고하면, 발화의도예측부(230)는 발화의도데이터('날씨')를 미리 저장된 의도예측알고리즘의 입력값으로 입력하여 다음 시점 이후의 제1 시점에서 사용자의 예측발화를 제3 확률값으로 산출할 수 있다.Referring to FIG. 5 , the speech intention prediction unit 230 inputs speech intention data ('weather') as an input value of an intention prediction algorithm stored in advance to predict the user's speech at a first time point after the next time point. It can be calculated with 3 probability values.
구체적으로, 발화의도데이터가 미리 저장된 의도예측알고리즘의 입력값으로 입력된 경우, 발화의도예측부(230)는 제1 시점에서 사용자의 예측발화가 '옷'일 제3 확률값을 0.6으로 산출할 수 있다. 발화의도예측부(230)는 다음 시점 이후의 제1 시점에서 사용자의 예측발화가 '장소'일 제3 확률값을 0.4로 산출할 수 있다. Specifically, when speech intention data is input as an input value of an intention prediction algorithm stored in advance, the speech intention prediction unit 230 calculates a third probability value that the user's predicted utterance is 'clothes' at the first point in time as 0.6. can do. The utterance intention prediction unit 230 may calculate a third probability value that the predicted utterance of the user at the first time point after the next time point is 'place' as 0.4.
발화의도예측부(230)는 제3 확률값 중 가장 높은 확률값(0.6)에 대응하는 예측발화('옷')를 제1 시점에서 사용자의 예측발화로 결정할 수 있다. 발화의도예측부(230)는 사용자의 예측발화에 기초하여 '옷'을 포함하는 제1 예측발화데이터를 생성할 수 있다. The utterance intention prediction unit 230 may determine the predicted utterance ('clothing') corresponding to the highest probability value (0.6) among the third probability values as the user's predicted utterance at the first time point. The speech intention prediction unit 230 may generate first predicted speech data including 'clothes' based on the user's predicted speech.
또한, 발화의도예측부(230)는 제1 예측발화데이터('옷')를 미리 저장된 의도예측알고리즘의 입력값으로 입력하여 제1 시점 이후의 제2 시점에서 사용자의 예측발화를 제3 확률값으로 산출할 수 있다. In addition, the speech intention prediction unit 230 inputs the first predicted speech data ('clothes') as an input value of the intention prediction algorithm stored in advance to predict the user's predicted speech at the second time point after the first time point as a third probability value. can be calculated as
구체적으로, 제1 예측발화데이터가 미리 저장된 의도예측알고리즘의 입력값으로 입력된 경우, 발화의도예측부(230)는 제2 시점에서 사용자의 예측발화가 '노래'일 제3 확률값을 0.3으로 산출할 수 있다. 발화의도예측부(230)는 제2 시점에서 사용자의 예측발화가 '장소'일 제3 확률값을 0.7로 산출할 수 있다. Specifically, when the first predicted speech data is input as an input value of the intention prediction algorithm stored in advance, the speech intention prediction unit 230 sets the third probability value that the user's predicted speech is 'song' at the second point in time to 0.3. can be calculated The utterance intention prediction unit 230 may calculate a third probability value that the predicted utterance of the user at the second time point is 'place' as 0.7.
발화의도예측부(230)는 제3 확률값 중 가장 높은 확률값(0.7)에 대응하는 예측발화('장소')를 제2 시점에서 사용자의 예측발화로 결정할 수 있다. 발화의도예측부(230)는 사용자의 예측발화에 기초하여 '장소'를 포함하는 제2 예측발화데이터를 생성할 수 있다. The utterance intention prediction unit 230 may determine the predicted utterance ('place') corresponding to the highest probability value (0.7) among the third probability values as the user's predicted utterance at the second point in time. The speech intention prediction unit 230 may generate second predicted speech data including 'place' based on the user's predicted speech.
발화의도분류부(220)는 발화의도데이터('날씨') 및 예측발화데이터('옷', '장소')를 응답생성부(30)에 제공할 수 있다.The speech intention classification unit 220 may provide speech intention data ('weather') and predicted speech data ('clothes', 'place') to the response generator 30.
응답텍스트생성부(300)는 발화의도데이터 및 예측발화데이터를 미리 저장된 응답알고리즘의 입력값으로 입력하여 사용자의 요청발화에 대한 응답텍스트를 생성할 수 있다. The response text generator 300 may generate a response text in response to a user's requested utterance by inputting utterance intention data and predicted utterance data as input values of a previously stored response algorithm.
구체적으로, 발화의도데이터('날씨') 및 예측발화데이터('옷', '장소')가 미리 저장된 응답알고리즘의 입력값으로 입력된 경우, '오늘 날씨는 덥습니다. 냉방병 예방을 위해 얇고 긴 옷을 추천드립니다.'로 구성된 응답텍스트가 출력으로 생성될 수 있다. Specifically, when speech intention data ('weather') and predicted speech data ('clothes', 'place') are input as input values of a pre-stored response algorithm, 'Today's weather is hot. A response text consisting of 'It is recommended to wear thin and long clothes to prevent air-conditioning sickness' may be generated as an output.
즉, 응답텍스트생성부(300)는 '오늘 날씨는 덥습니다. 냉방병 예방을 위해 얇고 긴 옷을 추천드립니다.'로 구성된 응답텍스트를 생성할 수 있다. That is, the response text generator 300 says 'Today's weather is hot. A response text consisting of 'We recommend thin and long clothes to prevent air-conditioning sickness' can be created.
텍스트-음성변환부(310)는 응답텍스트('오늘 날씨는 덥습니다. 냉방병 예방을 위해 얇고 긴 옷을 추천드립니다.')를 음성데이터로 변환할 수 있다. 텍스트-음성변환부(310)는 음성데이터를 인공지능스피커(20)에 전달할 수 있다.The text-to-speech conversion unit 310 may convert response text ('The weather is hot today. We recommend wearing thin and long clothes to prevent air-conditioning sickness.') into voice data. The text-to-speech conversion unit 310 may transmit voice data to the artificial intelligence speaker 20 .
인공지능스피커(2)는 음성데이터('오늘 날씨는 덥습니다. 냉방병 예방을 위해 얇고 긴 옷을 추천드립니다.')를 사용자의 요청발화에 대한 응답으로 출력할 수 있다. The artificial intelligence speaker 2 can output voice data ('The weather is hot today. We recommend wearing thin and long clothes to prevent air-conditioning sickness') as a response to the user's request.
도 6은 본 발명의 한 실시예에 따른 인공지능에 기초한 음성 처리 방법을 설명하는 도면이다. 6 is a diagram illustrating a voice processing method based on artificial intelligence according to an embodiment of the present invention.
단계(S10)에서 음성추출부가 어느 한 시점에서의 사용자의 요청발화에 기초하여 제1 음성데이터를 추출할 수 있다.In step S10, the voice extraction unit may extract first voice data based on the user's requested utterance at any point in time.
구체적으로, 사용자(3)가 인공지능스피커(2)에 어느 한 시점에서 다른 한 시점까지 요청발화를 하는 경우, 음성추출부(100)는 어느 한 시점에서의 사용자의 요청발화에 기초하여 제1 음성데이터를 추출할 수 있다.Specifically, when the user 3 makes a requested utterance to the artificial intelligence speaker 2 from one point in time to another point in time, the voice extraction unit 100 performs a first step based on the user's requested utterance at a certain point in time. Voice data can be extracted.
단계(S11)에서 제1 딥러닝부가 제1 음성데이터를 미리 저장된 제1 판단알고리즘의 입력값으로 입력하여 사용자의 성별 및 연령을 판단할 수 있다.In step S11, the first deep learning unit may determine the gender and age of the user by inputting the first voice data as an input value of the first determination algorithm stored in advance.
구체적으로, 제1 딥러닝부(110)는 제1 음성데이터를 미리 저장된 제1 판단알고리즘의 입력값으로 하여 사용자(3)의 성별 및 연령을 제1 확률값으로 산출할 수 있다. 이때, 제1 딥러닝부(11)는 산출된 제1 확률값 중 가장 높은 확률값에 대응하는 성별 및 연령을 사용자(3)의 성별 및 연령으로 결정할 수 있다.Specifically, the first deep learning unit 110 may calculate the gender and age of the user 3 as a first probability value by using the first voice data as an input value of the first decision algorithm stored in advance. At this time, the first deep learning unit 11 may determine the gender and age corresponding to the highest probability value among the calculated first probability values as the gender and age of the user 3 .
단계(S12)에서 음성인식기분배부가 사용자의 성별 및 연령에 대응하는 어느 하나의 음성인식기를 결정할 수 있다.In step S12, the voice recognizer distribution unit may determine one voice recognizer corresponding to the user's gender and age.
구체적으로, 음성인식기분배부(120)는 사용자(3)의 성별 및 연령에 따라서 어느 한 시점에서 다른 시점까지 사용자(3)의 요청발화를 인식하고, 음성데이터를 생성하는 서로 다른 복수의 음성인식기를 포함할 수 있다. 음성인식기분배부(120)는 사용자의 성별 및 연령에 대응하는 어느 하나의 음성인식기를 결정할 수 있다.Specifically, the voice recognizer distribution unit 120 recognizes the requested utterance of the user 3 from one point to another according to the gender and age of the user 3, and a plurality of different voice recognizers that generate voice data. can include The voice recognizer distribution unit 120 may determine one voice recognizer corresponding to the user's gender and age.
단계(S13)에서 어느 하나의 음성인식기가 어느 한 시점에서 다른 시점 까지의 사용자의 요청발화에 기초하여 제2 음성데이터를 추출할 수 있다.In step S13, any one voice recognizer may extract second voice data based on the user's requested utterance from one point in time to another point in time.
구체적으로, 사용자의 성별 및 연령에 대응하는 어느 하나의 음성인식기를 이용하여 어느 한 시점에서 다른 시점까지의 사용자(3)의 요청발화를 인식할 수 있다. 어느 하나의 음성인식기는 사용자(3)의 요청발화를 인식하고 이에 기초하여 제2 음성데이터를 추출할 수 있다. Specifically, it is possible to recognize the requested utterance of the user 3 from one point in time to another point in time using any one voice recognizer corresponding to the user's gender and age. Any one of the voice recognizers may recognize the requested utterance of the user 3 and extract the second voice data based on it.
단계(S14)에서 음성-텍스트변환부가 제2 음성데이터를 제1 변환텍스트로 변환할 수 있다.In step S14, the voice-to-text converter may convert the second voice data into the first converted text.
구체적으로, 음성-텍스트변환부(200)는 음성인식기분배부(120)에 포함된 어느 하나의 음성인식기에서 추출된 제2 음성데이터를 제1 변환텍스트로 변환할 수 있다.Specifically, the voice-to-text conversion unit 200 may convert second voice data extracted from any one voice recognizer included in the voice recognizer distribution unit 120 into first converted text.
단계(S15)에서 제1 음성모델딥러닝부에 미리 저장된 제1 딥러닝알고리즘의 입력값으로 제1 변환텍스트를 입력하여 제1 요청텍스트를 생성할 수 있다. In step S15, the first requested text may be generated by inputting the first converted text as the input value of the first deep learning algorithm stored in advance in the first voice model deep learning unit.
단계(S16)에서 발화의도분류부에 미리 저장된 의도분류알고리즘의 입력값으로 제1 요청텍스트를 입력하여 발화의도를 판단하고, 발화의도데이터를 생성할 수 있다.In step S16, the first request text may be input as an input value of the intention classification algorithm previously stored in the speech intention classification unit to determine the speech intention and generate speech intention data.
구체적으로, 발화의도분류부(220)는 제1 요청텍스트를 미리 저장된 의도분류알고리즘의 입력값으로 입력하여 사용자의 발화의도를 제2 확률값으로 산출할 수 있다. 발화의도분류부(220)는 산출된 제2 확률값 중 가장 높은 확률값에 대응하는 발화의도를 사용자의 발화의도로 결정할 수 있다. Specifically, the speech intention classification unit 220 may calculate the user's speech intention as a second probability value by inputting the first request text as an input value of the intention classification algorithm stored in advance. The speech intention classification unit 220 may determine the speech intention corresponding to the highest probability value among the calculated second probability values as the user's speech intention.
단계(S17)에서 발화의도예측부에 미리 저장된 의도예측알고리즘의 입력값으로 발화의도데이터를 입력하여 예측발화를 판단하고, 예측발화데이터를 생성할 수 있다.In step S17, the predicted speech may be determined by inputting speech intention data as an input value of the intention prediction algorithm stored in advance in the speech intention prediction unit, and predicted speech data may be generated.
구체적으로, 발화의도예측부(230)는 발화의도데이터를 미리 저장된 의도예측알고리즘의 입력값으로 입력하여 사용자의 예측발화를 제3 확률값으로 산출할 수 있다. 이때, 발화의도예측부(230)는 산출된 제3 확률값 중 가장 높은 확률값에 대응하는 예측발화를 사용자의 예측발화로 결정할 수 있다. Specifically, the speech intention prediction unit 230 may calculate the user's predicted speech as a third probability value by inputting speech intention data as an input value of the intention prediction algorithm stored in advance. In this case, the utterance intention prediction unit 230 may determine the predicted utterance corresponding to the highest probability value among the calculated third probability values as the user's predicted utterance.
단계(S18)에서 응답텍스트생성부에 미리 저장된 응답알고리즘의 입력값으로 발화의도데이터와 예측발화데이터를 입력하여 응답을 생성할 수 있다.In step S18, a response may be generated by inputting speech intention data and predicted speech data as input values of the response algorithm previously stored in the response text generator.
구체적으로, 응답텍스트생성부(300)는 발화의도데이터 및 예측발화데이터를 미리 저장된 응답알고리즘의 입력값으로 입력하여 사용자의 요청발화에 대한 응답을 생성할 수 있다.Specifically, the response text generator 300 may generate a response to the user's requested utterance by inputting utterance intention data and predicted utterance data as input values of a response algorithm stored in advance.
도 7은 본 발명의 한 실시예에 다른 제1 딥러닝알고리즘을 조정하는 방법을 설명하는 도면이다. 7 is a diagram explaining a method of adjusting a first deep learning algorithm according to an embodiment of the present invention.
단계(S20)에서 요청텍스트출력부는 제1 변환텍스트 및 제1 요청텍스트를 제1 오류율산출부에 제공할 수 있다. In step S20, the request text output unit may provide the first converted text and the first request text to the first error rate calculator.
구체적으로, 제1 음성모델딥러닝부(212)에 미리 학습되어 저장된 제1 음성딥러닝부(212)의 입력값인 제1 변환텍스트와 출력값인 제1 요청텍스트는 요청텍스트출력부(213)에 제공될 수 있다. 이때, 요청텍스트출력부(213)는 제1 변환텍스트 및 제1 요청텍스트를 제1 오류율산출부(2120)에 제공할 수 있다. Specifically, the first converted text as an input value and the first request text as an output value of the first voice deep learning unit 212 learned and stored in advance in the first voice model deep learning unit 212 are requested text output unit 213 can be provided in At this time, the request text output unit 213 may provide the first converted text and the first request text to the first error rate calculator 2120 .
단계(S21)에서 제1 오류율산출부(2120)가 제1 변환텍스트와 제1 요청텍스트 사이의 제1 오류율(또는, 제1 손실값)을 산출할 수 있다.In step S21, the first error rate calculation unit 2120 may calculate a first error rate (or a first loss value) between the first converted text and the first request text.
단계(S22)에서 기준텍스트저장부(214)는 제1 기준텍스트(또는, 제1 전사텍스트)를 제2 음성모델딥러닝부(215)에 제공할 수 있다.In step S22, the reference text storage unit 214 may provide the first reference text (or first transcription text) to the second voice model deep learning unit 215.
단계(S23)에서 제2 음성모델딥러닝부(215)에 미리 저장된 제2 딥러닝알고리즘의 입력값으로 제1 기준텍스트(또는, 제1 전사텍스트)를 입력하고, 출력값을 출력할 수 있다. In step S23, the first reference text (or first transcription text) may be input as an input value of the second deep learning algorithm previously stored in the second voice model deep learning unit 215, and an output value may be output.
단계(S24)에서 제2 오류율산출부(2150)에서 제1 기준텍스트(또는, 제1 전사텍스트)와 제2 기준텍스트(또는, 제2 전사텍스트) 사이의 제2 오류율을 산출할 수 있다.In step S24, the second error rate calculation unit 2150 may calculate a second error rate between the first reference text (or first transcription text) and the second reference text (or second transcription text).
구체적으로, 제2 음성모델딥러닝부(215)는 제1 기준텍스트(또는, 제1 전사텍스트)와 동일한 제2 기준텍스트(또는, 제2 전사텍스트)를 출력하도록 제2 음성모델딥러닝부를 미세조정하고 딥러닝할 수 있다. Specifically, the second voice model deep learning unit 215 is configured to output the second reference text (or second transcription text) identical to the first reference text (or first transcription text) by the second voice model deep learning unit. It can be fine-tuned and deep-learned.
구체적으로, 제2 오류율산출부(2150)는 기준텍스트저장부(214)로부터 제공받은 제1 기준텍스트(또는, 제1 전사텍스트)를 미리 저장된 제2 딥러닝알고리즘의 입력값으로하고, 출력값 사이의 제2 오류율(또는, 제2 손실값)을 산출할 수 있다.Specifically, the second error rate calculation unit 2150 sets the first reference text (or first transcription text) provided from the reference text storage unit 214 as an input value of the second deep learning algorithm stored in advance, and between the output values A second error rate (or second loss value) of can be calculated.
제2 음성모델딥러닝부(215)는 제2 오류율(또는, 제2 손실값)을 이용하여 제2 딥러닝알고리즘의 출력값이 제1 기준텍스트(또는, 제1 전사텍스트)와 동일한 제2 기준텍스트(또는, 제2 전사텍스트)가 될 때까지 제2 딥러닝알고리즘을 미세조정할 수 있다. The second speech model deep learning unit 215 uses a second error rate (or a second loss value) so that the output value of the second deep learning algorithm is the same as the first reference text (or first transcription text). The second deep learning algorithm can be fine-tuned until it becomes text (or second transcription text).
단계(S25)에서 가중치값산출부(216)에서 제1 오류율(또는, 제1 손실값)과 제2 오류율(또는, 제2 손실값)을 이용하여 가중치값을 산출할 수 있다.In step S25, the weight value calculation unit 216 may calculate the weight value using the first error rate (or the first loss value) and the second error rate (or the second loss value).
구체적으로, 가중치값산출부(216)는 제1 오류율값(또는, 제1 손실값)과 제2 오류율값(또는, 제2 손실값)에 기초하여 제1 음성모델딥러닝부(212)를 미세조정하기 위한 가중치값을 산출할 수 있다. Specifically, the weight value calculation unit 216 calculates the first voice model deep learning unit 212 based on the first error rate value (or first loss value) and the second error rate value (or second loss value). Weight values for fine adjustment can be calculated.
단계(S26)에서 가중치값산출부가 가중치값을 제1 음성모델딥러닝부에 제공할 수 있다.In step S26, the weight value calculation unit may provide the weight values to the first speech model deep learning unit.
단계(S27)에서 제1 음성모델딥러닝부는 가중치값에 기초하여 미리 저장된 제1 딥러닝알고리즘을 미세조정할 수 있다.In step S27, the first voice model deep learning unit may fine-tune the pre-stored first deep learning algorithm based on the weight value.
구체적으로, 가중치값산출부(216)에서 제공된 가중치값을 제1 음성모델딥러닝부(212)의 가중치값으로 하여 제1 딥러닝알고리즘을 조정하고, 제1 변환텍스트를 미세조정된 제1 딥러닝알고리즘의 입력값으로 입력하여 제1 음성모델딥러닝부(212)를 딥러닝시킬 수 있다. Specifically, the weight value provided by the weight value calculation unit 216 is used as the weight value of the first speech model deep learning unit 212 to adjust the first deep learning algorithm, and the first converted text is fine-tuned. The first voice model deep learning unit 212 may be deep-learned by inputting the input value of the learning algorithm.
지금까지 참조한 도면과 기재된 발명의 상세한 설명은 단지 본 발명의 예시적인 것으로서, 이는 단지 본 발명을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시 예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다.The drawings and detailed description of the present invention referred to so far are only examples of the present invention, which are only used for the purpose of explaining the present invention, and are used to limit the scope of the present invention described in the meaning or claims. It is not. Therefore, those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom. Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.
이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(Arithmetic Logic Unit), 디지털 신호 프로세서(Digital Signal Processor), 마이크로컴퓨터, FPGA(Field Programmable Gate Array), PLU(Programmable Logic Unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다.The embodiments described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA) array), programmable logic units (PLUs), microprocessors, or any other device capable of executing and responding to instructions.
처리 장치는 운영 체제 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술 분야에서 통상의 지식을 가진 자는 처리 장치가 복수 개의 처리 요소(Processing Element) 및/또는 복수 유형의 처리요소를 포함할 수 있음을 이해할 것이다.A processing device may run an operating system and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art know that a processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It will be understood that it can include
예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(Parallel Processor) 와 같은, 다른 처리 구성(Processing configuration)도 가능하다. 소프트웨어는 컴퓨터 프로그램(Computer Program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.For example, a processing device may include a plurality of processors or a processor and a controller. Also, other processing configurations are possible, such as a parallel processor. Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device.
소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody) 될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.
실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. Computer readable media may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software.
컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CDROM, DVD와 같은 광기록 매체(optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CDROMs and DVDs, and ROMs, RAMs, and flash memories. hardware devices specially configured to store and execute program instructions, such as; Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.
이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다. 그러므로, 다른 구현들, 다른 실시예들 및 청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved. Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims (16)

  1. 적어도 하나의 프로세서를 포함하는 인공지능에 기초한 음성 처리 시스템에 있어서, In the voice processing system based on artificial intelligence comprising at least one processor,
    상기 적어도 하나의 프로세서는, The at least one processor,
    어느 한 시점에서 다른 시점까지의 사용자의 요청발화에 기초하여 상기 사용자의 성별 및 연령을 판단하는 음성판단부;a voice determination unit that determines the user's gender and age based on the user's requested utterances from one point in time to another point in time;
    상기 사용자의 요청발화를 텍스트로 변환하여 상기 사용자의 발화의도를 판단하고 상기 다른 시점 이후의 상기 사용자의 예측발화를 판단하는 음성처리부; 및 a voice processing unit that converts the user's requested utterance into text, determines the user's utterance intention, and determines the user's predicted utterance after the other point in time; and
    상기 사용자의 발화의도와 상기 예측된 상기 예측발화에 기초하여 상기 사용자의 요청발화에 대한 응답을 생성하는 응답생성부를 포함하는,And a response generator for generating a response to the user's requested utterance based on the user's utterance intention and the predicted predicted utterance.
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  2. 제1 항에 있어서,According to claim 1,
    상기 음성판단부는, The voice judgment unit,
    상기 어느 한 시점에서의 상기 요청발화에 기초하여 제1 음성데이터를 추출하는 음성추출부;a voice extraction unit extracting first voice data based on the requested utterance at any one time point;
    상기 제1 음성데이터를 미리 저장된 제1 판단알고리즘의 입력값으로하여 상기 사용자의 성별 및 연령을 제1 확률값으로 산출하고 상기 제1 확률값에 기초하여 상기 사용자의 성별 및 연령을 결정하는 제1 딥러닝부; 및First deep learning for calculating the gender and age of the user as a first probability value by using the first voice data as an input value of a first decision algorithm stored in advance, and determining the gender and age of the user based on the first probability value wealth; and
    상기 결정된 성별 및 연령에 기초하여 상기 어느 한 시점에서 상기 다른 시점까지 상기 사용자의 요청발화를 인식하고 제2 음성데이터를 추출하는 음성인식기분배부를 포함하는,And a voice recognizer distribution unit for recognizing the requested utterance of the user from the one time point to the other time point based on the determined gender and age and extracting second voice data,
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  3. 제2 항에 있어서,According to claim 2,
    상기 제1 딥러닝부는,The first deep learning unit,
    상기 제1 확률값 중 가장 높은 확률값에 대응하는 성별 및 연령을 상기 사용자의 성별 및 연령으로 결정하는,determining the gender and age corresponding to the highest probability value among the first probability values as the gender and age of the user;
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  4. 제2 항에 있어서,According to claim 2,
    상기 음성인식기분배부는,The voice recognizer distribution unit,
    상기 결정된 성별 및 연령에 대응하여 상기 사용자의 요청발화를 인식하고 상기 제2 음성데이터를 생성하는 서로 다른 복수의 음성인식기를 포함하는,Including a plurality of different voice recognizers for recognizing the user's requested utterance and generating the second voice data corresponding to the determined gender and age,
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  5. 제2 항에 있어서,According to claim 2,
    상기 음성처리부는,The voice processing unit,
    상기 제2 음성데이터를 제1 변환텍스트로 변환하는 음성-텍스트변환부;a voice-to-text converter that converts the second voice data into first converted text;
    상기 제1 변환텍스트를 미리 저장된 제1 딥러닝알고리즘의 입력값으로 입력하여 상기 사용자의 요청발화에 상응하는 제1 요청텍스트를 결정하는 제2 딥러닝부;a second deep learning unit inputting the first converted text as an input value of a first deep learning algorithm stored in advance and determining a first request text corresponding to the user's request utterance;
    상기 제1 요청텍스트를 미리 저장된 의도분류알고리즘의 입력값으로 입력하여 상기 발화의도에 대한 발화의도데이터를 생성하는 발화의도분류부; 및a speech intention classification unit generating speech intention data for the speech intention by inputting the first request text as an input value of an intention classification algorithm stored in advance; and
    상기 발화의도데이터를 미리 저장된 의도예측알고리즘의 입력값으로 입력하여 상기 예측발화에 대한 예측발화데이터를 생성하는 발화의도예측부를 포함하는,And a speech intention prediction unit inputting the speech intention data as an input value of an intention prediction algorithm stored in advance to generate predicted speech data for the predicted speech.
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  6. 제5 항에 있어서,According to claim 5,
    상기 제2 딥러닝부는,The second deep learning unit,
    상기 제1 변환텍스트를 입력받는 변환텍스트입력부;a converted text input unit receiving the first converted text;
    상기 제1 변환텍스트를 상기 미리 저장된 제1 딥러닝알고리즘의 입력값으로 입력하여 상기 제1 요청텍스트를 생성하는 제1 음성모델딥러닝부; 및a first voice model deep learning unit generating the first request text by inputting the first converted text as an input value of the first deep learning algorithm stored in advance; and
    상기 제1 요청텍스트를 출력하는 요청텍스트출력부를 포함하는,Including a request text output unit for outputting the first request text,
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  7. 제6 항에 있어서,According to claim 6,
    상기 제2 딥러닝부는,The second deep learning unit,
    상기 사용자의 요청발화에 상응하는 제1 기준텍스트를 저장하는 기준텍스트저장부; 및a reference text storage unit for storing a first reference text corresponding to the user's requested utterance; and
    상기 제1 변환텍스트와 상기 제1 요청텍스트 사이의 제1 오류율값을 산출하는 제1 오류율산출부를 더 포함하는,Further comprising a first error rate calculation unit for calculating a first error rate value between the first converted text and the first request text,
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  8. 제7 항에 있어서,According to claim 7,
    상기 제2 딥러닝부는,The second deep learning unit,
    상기 제1 기준텍스트를 미리 저장된 제2 딥러닝알고리즘의 입력값으로 입력하여 딥러닝을 수행하는 제2 음성모델딥러닝부;a second voice model deep learning unit for performing deep learning by inputting the first reference text as an input value of a previously stored second deep learning algorithm;
    상기 제1 기준텍스트와 상기 제2 음성모델딥러닝부의 출력값인 제2 기준텍스트 사이의 제2 오류율값을 산출하는 제2 오류율산출부; 및a second error rate calculation unit calculating a second error rate value between the first reference text and a second reference text that is an output value of the second speech model deep learning unit; and
    상기 제1 오류율값과 상기 제2 오류율값에 기초하여 상기 제1 음성모델딥러닝부를 딥러닝하기 위한 가중치값을 산출하는 가중치값산출부를 더 포함하는,Further comprising a weight value calculation unit for calculating weight values for deep learning the first speech model deep learning unit based on the first error rate value and the second error rate value.
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  9. 제8 항에 있어서,According to claim 8,
    상기 제1 음성모델딥러닝부는,The first voice model deep learning unit,
    상기 가중치값을 상기 미리 저장된 제1 딥러닝알고리즘의 가중치값으로 하고, 상기 제1 변환텍스트를 입력값으로 입력하여 딥러닝을 수행하는,Performing deep learning by setting the weight value as a weight value of the pre-stored first deep learning algorithm and inputting the first converted text as an input value,
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  10. 제9 항에 있어서,According to claim 9,
    상기 발화의도분류부는,The utterance intention classification unit,
    상기 제1 요청텍스트를 상기 미리 저장된 의도분류알고리즘의 입력값으로 입력하여 상기 발화의도를 제2 확률값으로 산출한 상기 발화의도데이터를 생성하는,generating the speech intention data obtained by calculating the speech intention as a second probability value by inputting the first request text as an input value of the intention classification algorithm stored in advance;
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  11. 제10 항에 있어서,According to claim 10,
    상기 발화의도분류부는,The utterance intention classification unit,
    상기 제2 확률값 중 가장 높은 확률값을 가지는 발화의도를 상기 사용자의 발화의도로 결정하는,determining the utterance intention having the highest probability value among the second probability values as the utterance intention of the user;
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  12. 제10 항에 있어서,According to claim 10,
    상기 발화의도예측부는,The ignition intention prediction unit,
    상기 발화의도데이터를 상기 미리 저장된 의도예측알고리즘의 입력값으로 입력하여 상기 예측발화를 제3 확률값으로 산출한 상기 예측발화데이터를 생성하는,generating the predicted speech data obtained by calculating the predicted speech as a third probability value by inputting the speech intention data as an input value of the pre-stored intention prediction algorithm;
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  13. 제11 항에 있어서,According to claim 11,
    상기 발화의도분류부는,The utterance intention classification unit,
    상기 제3 확률값 중 가장 높은 확률값을 가지는 예측발화를 상기 사용자의 예측발화로 판단하는,Determining the predicted utterance having the highest probability value among the third probability values as the predicted utterance of the user;
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  14. 제6 항에 있어서,According to claim 6,
    상기 응답생성부는,The response generating unit,
    상기 발화의도데이터 및 상기 예측발화데이터를 미리 저장된 응답알고리즘의 입력값으로하여 상기 사용자의 요청발화에 대한 응답텍스트를 생성하는 응답텍스트생성부; 및a response text generator configured to generate a response text in response to the requested utterance of the user by using the utterance intention data and the predicted utterance data as input values of a previously stored response algorithm; and
    상기 응답텍스트를 음성데이터로 변환하는 텍스트-음성변환부를 포함하는,Including a text-to-speech conversion unit for converting the response text into voice data,
    인공지능에 기초한 음성 처리 시스템.Voice processing system based on artificial intelligence.
  15. 적어도 하나의 프로세서를 포함하는 인공지능에 기초한 음성 처리 시스템에 의한 인공지능에 기초한 음성 처리 방법에 있어서, 상기 적어도 하나의 프로세서는 음성판단부, 음성처리부, 및 응답생성부를 포함하고, A voice processing method based on artificial intelligence by a voice processing system based on artificial intelligence including at least one processor, wherein the at least one processor includes a voice determination unit, a voice processing unit, and a response generation unit,
    상기 음성판단부에 의해 어느 한 시점에서 다른 시점까지의 사용자의 요청발화에 기초하여 상기 사용자의 성별 및 연령을 판단하는 단계; determining the user's gender and age based on the user's requested utterances from one point in time to another point in time by the voice determination unit;
    상기 음성처리부에 의해 상기 사용자의 요청발화를 텍스트로 변환하여 상기 사용자의 발화의도를 판단하고 상기 다른 시점 이후의 상기 사용자의 예측발화를 판단하는 단계; 및converting the user's requested utterance into text by the voice processing unit to determine the user's utterance intention and to determine the user's predicted utterance after the other time point; and
    상기 응답생성부에 의해 상기 사용자의 발화의도와 상기 예측된 상기 예측발화에 기초하여 상기 사용자의 요청발화에 대한 응답을 생성하는 단계를 포함하는,Generating a response to the user's requested utterance based on the user's utterance intention and the predicted predicted utterance by the response generating unit,
    인공지능에 기초한 음성 처리 방법.Voice processing method based on artificial intelligence.
  16. 제1 항의 인공지능에 기초한 음성 처리 시스템을 실행시키는 프로그램이 기록된 컴퓨터로 판독 가능한 비일시적 기록매체.A computer-readable non-transitory recording medium on which a program for executing the voice processing system based on the artificial intelligence of claim 1 is recorded.
PCT/KR2022/021461 2021-12-28 2022-12-28 Artificial intelligence-based dialogue situation prediction and intention classification system, and method thereof WO2023128586A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2021-0190154 2021-12-28
KR20210190154 2021-12-28
KR1020220041966A KR20230100543A (en) 2021-12-28 2022-04-05 System and method for conversational situation prediction and intention classification based on artificial intelligence
KR10-2022-0041966 2022-04-05

Publications (1)

Publication Number Publication Date
WO2023128586A1 true WO2023128586A1 (en) 2023-07-06

Family

ID=86999564

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/021461 WO2023128586A1 (en) 2021-12-28 2022-12-28 Artificial intelligence-based dialogue situation prediction and intention classification system, and method thereof

Country Status (1)

Country Link
WO (1) WO2023128586A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100119250A (en) * 2009-04-30 2010-11-09 삼성전자주식회사 Appratus for detecting voice using motion information and method thereof
JP2011108055A (en) * 2009-11-19 2011-06-02 Nippon Telegr & Teleph Corp <Ntt> Interactive system, interactive method, and interactive program
US20180174580A1 (en) * 2016-12-19 2018-06-21 Samsung Electronics Co., Ltd. Speech recognition method and apparatus
JP2018169494A (en) * 2017-03-30 2018-11-01 トヨタ自動車株式会社 Utterance intention estimation device and utterance intention estimation method
KR20200013152A (en) * 2018-07-18 2020-02-06 삼성전자주식회사 Electronic device and method for providing artificial intelligence services based on pre-gathered conversations

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100119250A (en) * 2009-04-30 2010-11-09 삼성전자주식회사 Appratus for detecting voice using motion information and method thereof
JP2011108055A (en) * 2009-11-19 2011-06-02 Nippon Telegr & Teleph Corp <Ntt> Interactive system, interactive method, and interactive program
US20180174580A1 (en) * 2016-12-19 2018-06-21 Samsung Electronics Co., Ltd. Speech recognition method and apparatus
JP2018169494A (en) * 2017-03-30 2018-11-01 トヨタ自動車株式会社 Utterance intention estimation device and utterance intention estimation method
KR20200013152A (en) * 2018-07-18 2020-02-06 삼성전자주식회사 Electronic device and method for providing artificial intelligence services based on pre-gathered conversations

Similar Documents

Publication Publication Date Title
WO2020060325A1 (en) Electronic device, system, and method for using voice recognition service
WO2020230926A1 (en) Voice synthesis apparatus for evaluating quality of synthesized voice by using artificial intelligence, and operating method therefor
WO2020085794A1 (en) Electronic device and method for controlling the same
WO2020040595A1 (en) Electronic device for processing user utterance, and control method therefor
WO2020226213A1 (en) Artificial intelligence device for providing voice recognition function and method for operating artificial intelligence device
WO2019151802A1 (en) Method of processing a speech signal for speaker recognition and electronic apparatus implementing same
WO2022035183A1 (en) Device for recognizing user&#39;s voice input and method for operating same
WO2020096218A1 (en) Electronic device and operation method thereof
EP3841460A1 (en) Electronic device and method for controlling the same
WO2014163231A1 (en) Speech signal extraction method and speech signal extraction apparatus to be used for speech recognition in environment in which multiple sound sources are outputted
WO2018056779A1 (en) Method of translating speech signal and electronic device employing the same
WO2020138662A1 (en) Electronic device and control method therefor
WO2021246812A1 (en) News positivity level analysis solution and device using deep learning nlp model
WO2021040490A1 (en) Speech synthesis method and apparatus
WO2023128586A1 (en) Artificial intelligence-based dialogue situation prediction and intention classification system, and method thereof
WO2023113502A1 (en) Electronic device and method for recommending speech command therefor
WO2023085584A1 (en) Speech synthesis device and speech synthesis method
WO2022177224A1 (en) Electronic device and operating method of electronic device
WO2022055107A1 (en) Electronic device for voice recognition, and control method therefor
WO2022149693A1 (en) Electronic device and method for processing user utterance in the electronic device
WO2022114451A1 (en) Artificial neural network training method, and pronunciation evaluation method using same
WO2023048359A1 (en) Speech recognition device and operation method therefor
WO2022186435A1 (en) Electronic device for correcting user&#39;s voice input and method for operating same
WO2023106649A1 (en) Electronic device for performing voice recognition by using recommended command
WO2024076214A1 (en) Electronic device for performing voice recognition, and operating method therefor

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22916727

Country of ref document: EP

Kind code of ref document: A1