CN113808575A - Voice interaction method, system, storage medium and electronic equipment - Google Patents

Voice interaction method, system, storage medium and electronic equipment Download PDF

Info

Publication number
CN113808575A
CN113808575A CN202010546399.9A CN202010546399A CN113808575A CN 113808575 A CN113808575 A CN 113808575A CN 202010546399 A CN202010546399 A CN 202010546399A CN 113808575 A CN113808575 A CN 113808575A
Authority
CN
China
Prior art keywords
information
voice
semantic
speaker
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010546399.9A
Other languages
Chinese (zh)
Inventor
杨昌品
宋德超
黄姿荣
贾巨涛
韩林峄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gree Electric Appliances Inc of Zhuhai
Zhuhai Lianyun Technology Co Ltd
Original Assignee
Gree Electric Appliances Inc of Zhuhai
Zhuhai Lianyun Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gree Electric Appliances Inc of Zhuhai, Zhuhai Lianyun Technology Co Ltd filed Critical Gree Electric Appliances Inc of Zhuhai
Priority to CN202010546399.9A priority Critical patent/CN113808575A/en
Publication of CN113808575A publication Critical patent/CN113808575A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a voice interaction method, a system, a storage medium and electronic equipment, which relate to the technical field of voice interaction, and the method comprises the following steps: acquiring voice information; determining characteristic information of a speaker who utters the voice information; determining the group category to which the speaker who sends the voice information belongs according to the characteristic information; obtaining a corpus matched with the group category; obtaining semantic intentions matched with the voice information from the corpus; controlling the smart device to perform an action responsive to the semantic intent. The invention has the beneficial effects that: and accurately identifying the semantic intention to be expressed by the voice information by using the corresponding corpus so as to improve the accuracy of semantic intention identification.

Description

Voice interaction method, system, storage medium and electronic equipment
Technical Field
The invention belongs to the technical field of voice interaction, and particularly relates to a voice interaction method, a voice interaction system, a storage medium and electronic equipment.
Background
In the voice interaction process, a user conversation part plays a role of starting and ending, a user converts the speech spoken by a client into a text through ASR (speech recognition) and then enters a conversation system, the specified service content is called after semantic understanding and conversation decision are carried out in the conversation system, the corresponding text content is output, and the corresponding text content is converted into a speech through TTS (text to speech) and then returned to the user on the client. At present, a common semantic understanding model is realized based on a general word technique training model, but different expression modes and styles of users are caused by different regions, ages and characters of different users, sentence structures of words expressing semantic intentions are different, so that most semantic understanding processing degrees are not high enough, and the intentions of the users cannot be accurately understood.
Disclosure of Invention
The invention provides a voice interaction method, a system, a storage medium and electronic equipment based on the technical problem that the existing semantic understanding technology cannot accurately understand the intentions of different users.
In a first aspect, an embodiment of the present invention provides a voice interaction method, including:
acquiring voice information;
determining characteristic information of a speaker who utters the voice information; wherein the characteristic information can be used for representing the group category to which the speaker belongs;
determining the group category to which the speaker who sends the voice information belongs according to the characteristic information;
obtaining a corpus matched with the group category;
obtaining semantic intentions matched with the voice information from the corpus;
controlling the smart device to perform an action responsive to the semantic intent.
Optionally, the corpus is established in advance by:
acquiring historical voice dialogue data of speakers belonging to the same group category, wherein the historical voice dialogue data comprises historical voice information and semantic intentions expressed by the historical voice information;
and performing statistical analysis on the historical voice dialogue data, determining common linguistic features of the historical voice information used when the vocalizers belonging to the group category express the same semantic intention from the historical voice dialogue data, and establishing an association relationship between the common linguistic features and the semantic intentions corresponding to the common linguistic features, so as to construct the corpus.
Optionally, the common linguistic feature includes at least one of a multi-frequency word, a keyword, a language sentence pattern, and a mood word.
Optionally, by performing statistical analysis on the historical speech dialogue data, determining common linguistic features of the historical speech information used when the vocalizers belonging to the group category express the same semantic intention from the historical speech dialogue data, and establishing an association relationship between the common linguistic features and the semantic intention corresponding to the common linguistic features, thereby constructing the corpus, including:
when the common language features comprise multi-frequency words, determining the multi-frequency words in the historical voice information and semantic intentions expressed by the multi-frequency words, and establishing an association relationship between the multi-frequency words and the semantic intentions corresponding to the multi-frequency words so as to construct the corpus; the multi-frequency words are words with the occurrence frequency exceeding a preset threshold value;
when the common language features comprise keywords, selecting historical voice information expressing the same semantic intention from the historical voice dialogue data;
determining a keyword capable of expressing the semantic intention from the selected historical voice information, and establishing an association relationship between the keyword and the semantic intention corresponding to the keyword so as to construct the corpus;
when the common linguistic features comprise linguistic patterns, counting the times of the linguistic patterns corresponding to the historical voice information used for expressing the same semantic intention in the historical voice dialogue data, and associating the linguistic pattern with the largest time with the semantic intention corresponding to the linguistic pattern so as to construct the corpus;
and when the common linguistic features comprise the tone words, counting the use times of the tone words in the historical voice information used for expressing the same semantic intention in the historical voice dialogue data, and establishing a correlation relationship between the tone words with the most use times and the semantic intentions corresponding to the tone words, so as to construct the corpus.
Optionally, the feature information includes at least one of age information, gender information, character information, and region information.
Optionally, determining feature information of a speaker who uttered the voice message includes:
extracting voiceprint features from the voice information, and determining identity information of a speaker who sends the voice information based on the voiceprint features;
and determining the characteristic information of the speaker according to the identity information of the speaker.
Optionally, determining feature information of a speaker who uttered the voice message includes:
when the feature information comprises age information and/or gender information, extracting sound spectrum features from the voice information, and determining the age information and/or gender information of a speaker who sends the voice information according to the sound spectrum features;
when the feature information comprises character information, determining the language expression style of the voice information, and determining the character type matched with the language expression style as character information of a speaker who sends the voice information according to the language expression style;
when the voice feature information comprises region information, extracting voice features from the voice information, and determining the region information of a speaker who sends the voice information according to the voice features; wherein the voice feature comprises at least one of intonation, pronunciation and intonation.
In a second aspect, an embodiment of the present invention further provides a voice interaction system, including:
the voice acquisition module is used for acquiring voice information;
the characteristic determining module is used for determining the characteristic information of a speaker who sends the voice information; wherein the characteristic information can be used for representing the group category to which the speaker belongs;
the group category determining module is used for determining the group category to which the speaker who sends the voice information belongs according to the characteristic information;
a corpus acquisition module, configured to acquire a corpus matched with the group category;
the semantic intention determining module is used for acquiring semantic intentions matched with the voice information from the corpus;
and the control module is used for controlling the intelligent equipment to execute the action responding to the semantic intention.
In a third aspect, an embodiment of the present invention further provides a storage medium, where the storage medium stores program codes, and when the program codes are executed by a processor, the voice interaction method is implemented as in any one of the foregoing embodiments.
In a fourth aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores program codes executable on the processor, and when the program codes are executed by the processor, the electronic device implements the voice interaction method as described in any one of the above embodiments.
In a fifth aspect, an embodiment of the present invention further provides a voice interaction system, including:
the client is used for acquiring voice information; and
a server comprising a memory, a processor, the memory having stored thereon program code executable on the processor, the program code implementing the voice interaction method as in any one of the above embodiments when executed by the processor.
According to the voice interaction method provided by the embodiment of the invention, the characteristic information of the speaker who sends the voice information is determined, the group category of the speaker is determined according to the characteristic information, and the corpus matched with the group category is further obtained, so that the semantic intention corresponding to the voice information is determined by using the corpus, and the intelligent device is controlled to execute the response action corresponding to the semantic intention. Therefore, the voice interaction method provided by the embodiment of the invention can match the corresponding corpus according to the group categories to which different speakers belong, so that the semantic intention to be expressed by the voice information is accurately identified by using the corresponding corpus, and the accurate identification of the semantic intention is realized.
Drawings
The scope of the present disclosure may be better understood by reading the following detailed description of exemplary embodiments in conjunction with the accompanying drawings. Wherein the included drawings are:
fig. 1 is a flow chart illustrating a voice interaction method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart illustrating a process of determining feature information of a speaker according to a second embodiment of the present invention;
fig. 3 is a schematic flowchart illustrating a process of constructing a corpus according to a second embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the following will describe in detail an implementation method of the present invention with reference to the accompanying drawings and embodiments, so that how to apply technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
Example one
According to an embodiment of the present invention, a voice interaction method is provided, and fig. 1 shows a flowchart of a voice interaction method according to an embodiment of the present invention, as shown in fig. 1, the voice interaction method may include: step 110 to step 160.
In step 110, voice information is obtained.
Here, the voice information refers to a voice conversation in which the user interacts with the intelligent device, and if the user interacts with the air conditioner and sends a voice of "help me inquire about the weather of the next day", the voice of "help me inquire about the weather of the next day" is used as the voice information. The intelligent equipment can be air conditioner, refrigerator, television, range hood and other intelligent equipment with a voice function.
In step 120, determining characteristic information of a speaker who utters the voice message; wherein the feature information can be used for characterizing the group category to which the speaker belongs.
Here, the feature information refers to attributes of the speaker, which are used to describe personalized features of the user. For example, the feature information includes at least one of age, gender, character and region, and it may also be professional, ethnic, etc. feature information capable of reflecting the user attribute. For example, the age may be classified into 60 th and above, 70 th, 80 th, 90 th, 00 th, 10 th, and the like; male and female according to sex; dividing into lively type, quiet special type, brave self-confident type, hardworking good type, independent type, creative shape and the like according to the character; according to the region, the Chinese characters can be divided into northeast, Jingjin, Huabei, West, Central plains, southwest, two lakes, two Guangdong, Jiangnan, Fujian and so on.
In step 130, according to the characteristic information, a group category to which a speaker who utters the voice information belongs is determined.
Here, the group category to which the utterer belongs refers to a user group to which the utterer belongs, and the feature information and the group category are in a mapping relationship. For example, if the determined feature information of the vocalist includes the age of 40 and the number of the Guangdong people, the group category to which the vocalist belongs is the number of the Guangdong people of 40. And if the determined characteristic information comprises the Guangdong people, the affiliated group category is the Guangdong people.
In step 140, a corpus matching the population class is obtained.
Here, different group types correspond to different corpuses, and for example, when the group type of the speaker a is a guangdong person, a corpus a matching the group type of the guangdong person is acquired to semantically understand the voice information generated by the speaker a. For example, if the speaker a is a lively-open-type guangdong, the corpus B matching the group type of the lively-open-type guangdong is acquired, and the speech information is semantically understood using the corpus B, so that the semantic intention of the speaker a is accurately recognized.
In step 150, semantic intents matching the speech information are obtained from the corpus.
Here, after converting the speech information into text information, the converted text information is used to perform matching in the corpus, thereby obtaining a semantic intention matching the speech information. The conversion of the speech information into the text information can be accomplished by the existing speech recognition technology (ASR), which is not described herein.
For example, if the group category of the speaker who utters the voice information of "good heat" is the guangdong person, the text information of "good heat" is used to perform matching in the corpus associated with the guangdong person, and the semantic intention is "open air conditioner".
In addition, the corpus corresponding to different group categories stores text information and semantic intentions associated with the text information, but the text information associated with the same semantic intentions in different corpuses may not be consistent. For example, guangdong is used to say "good heat o" and "good heat o" is associated with semantic intent "turn on air conditioner". While Sichuan people like saying "too much choking" and "too much choking" the associated semantic intent "turn on the air conditioner".
It should be noted that, although only the corpus of the group type in different regions is described in detail, the specific process of constructing the corpus in different regions is the same as the method of constructing the corpus in different regions.
In step 160, the smart device is controlled to perform an action responsive to the semantic intent.
Here, after determining the semantic intent of the uttered voice information of the utterer, the smart device is controlled to perform an action responsive to the semantic intent. And if the semantic intention is to turn on the air conditioner, controlling the air conditioner to be turned on.
It should be noted that the response action for controlling the smart device to implement the semantic intent may be to obtain response information matching the semantic intent from a preset database, where the semantic intent and the response information associated with the semantic intent are stored. For example, the semantic intent of "turn on air conditioner" is associated with a control instruction to turn on air conditioner. The database may be a corpus, that is, data of "text-semantic intention-response information" stored in the corpus, such as "control instruction for too hot-turning on air conditioner".
In addition, the action responding to the semantic intention can be a control instruction for realizing the semantic intention and/or response voice, if the semantic intention is to turn on the air conditioner, the response action can be a control instruction for controlling the turning on of the air conditioner and/or feedback response voice to the speaker, and the response voice can be that the air conditioner is turned on for you.
In the embodiment, the group category to which the different speaker belongs can be determined according to the feature information of the different speakers, so that the corpus corresponding to the group category of the speaker is matched, and the semantic intent to be expressed by the voice information is accurately identified by using the corresponding corpus, so that the semantic intent can be accurately identified.
Example two
On the basis of the above embodiment, a second embodiment of the present invention may further provide a voice interaction method. The voice interaction method can comprise the following steps: step 210 to step 260.
In step 210, voice information is obtained.
Here, the voice information refers to a voice conversation in which the user interacts with the intelligent device, and if the user interacts with the air conditioner and sends a voice of "help me inquire about the weather of the next day", the voice of "help me inquire about the weather of the next day" is used as the voice information. The intelligent equipment can be air conditioner, refrigerator, television, range hood and other intelligent equipment with a voice function.
In step 220, determining characteristic information of a speaker who utters the voice message; wherein the feature information can be used for characterizing the group category to which the speaker belongs.
The characteristic information includes at least one of age information, gender information, character information and region information. For example, the age may be classified into 60 th and above, 70 th, 80 th, 90 th, 00 th, 10 th, and the like; male and female according to sex; dividing into lively type, quiet special type, brave self-confident type, hardworking good type, independent type, creative shape and the like according to the character; according to the region, the Chinese characters can be divided into northeast, Jingjin, Huabei, West, Central plains, southwest, two lakes, two Guangdong, Jiangnan, Fujian and so on.
In one embodiment, the step 220 of determining feature information of a speaker who uttered the voice message may include: step 221 to step 222.
In step 221, voiceprint features are extracted from the voice message, and identity information of a speaker who uttered the voice message is determined based on the voiceprint features.
Here, extracting the voiceprint features from the voice information may be extracting the voiceprint features from the voice information through a VQ clustering module, where the voiceprint features include frequency, pitch, nasal sound, breathing, and other features, and extracting prosodic features of the user, and the extracting includes: speech rhythm, speech speed, intonation, accent, etc. After the voiceprint features are extracted, the voiceprint features are compared with voiceprints pre-stored in a database, and therefore identity information of a speaker who sends the voice information is determined.
In step 222, feature information of the speaker is determined according to the identity information of the speaker.
Here, after the identity information of the utterer is determined, the feature information of the utterer can be determined using the identity information. The characteristic information is characteristic information which is pre-recorded by a speaker, data of 'identity information-characteristic information' is stored in a database, and the characteristic information of the corresponding speaker can be inquired through the identity information, namely, the age information, the gender information, the character information and the region information of the user are pre-recorded by the user.
In another implementation, fig. 2 is a schematic flow chart illustrating a process of determining feature information of a speaker according to a second embodiment of the present invention, and as shown in fig. 2, the feature information of the speaker who sends the voice information is determined, which may be feature information of the speaker who sends the voice information, which is determined from the voice information.
The specific process of determining the age information and/or the gender information of the speaker may be: and extracting sound spectrum characteristics from the voice information, and determining age information and/or gender information of a speaker who sends the voice information according to the sound spectrum characteristics.
Here, the voice spectrum characteristics of the voice uttered by the users of different ages and different genders are different, and the age information and/or the gender information of the speaker who utters the voice information can be determined by the voice spectrum characteristics. The specific process of identifying the age information and/or the gender information is as follows: the method can collect voices of males and females in different age groups, then extract voice frequency spectrum characteristics of the voices, and train by using the voice frequency spectrum characteristics to obtain a gender identification model and/or an age identification model. And further, the gender recognition model and/or the age recognition model are used for recognizing the extracted voice spectrum characteristics in the voice information, so that the gender information and/or the age information of the speaker can be determined.
In addition, the specific process of determining the personality information of the speaker may be:
and determining the language expression style of the voice information, and determining the character type matched with the language expression style as character information of a speaker who sends the voice information according to the language expression style.
Here, the speaking style of users with different characters will be different, for example, the vivid speaking style will be humorous, the character \33148, the speaking style of users with huki huyao is polite. The determining of the linguistic expression style of the speech information may be converting the speech information into text information, and analyzing sentence pattern structures and language words used by the text information, thereby determining the linguistic expression style of a speaker who utters the speech information. The specific process of determining the character type matched with the language expression style as the character information of the speaker who sends the voice information can be to collect the language expression styles of users with different characters in advance to form data of 'language expression style-character type', after the language expression style of the speaker is determined, the voice expression style is used for searching, the character type matched with the language expression style is determined, and therefore the character information of the speaker is obtained.
In addition, the specific process of determining the geographical information of the speaker may be:
extracting voice features from the voice information, and determining region information of a speaker who sends the voice information according to the voice features; wherein the voice feature comprises at least one of intonation, pronunciation and intonation.
Here, even though people in different regions speak the same sentence, there are differences in accents, speech speed, intonation, pronunciation, and intonation, and people in the south of the lake often confuse the pronunciation of n/l. The method comprises the steps of collecting voices of different regions, extracting voice features, and storing the voice features in a region recognition database so as to be used for recognizing the voices of different regions. After receiving the voice information, extracting the voice characteristics of the voice information, and further identifying the voice characteristics in the region identification database, thereby determining the region information of the speaker who sends the voice information.
In the embodiment, by determining the characteristic information such as age information, sex information, character information, and region information of the speaker from the voice information, the characteristic information of the speaker can be accurately recognized when the speaker does not input the corresponding characteristic information, thereby providing an accurate data base for the subsequent semantic intention recognition.
In step 230, according to the feature information, a group category to which a speaker who utters the voice information belongs is determined.
Here, the group category to which the utterer belongs refers to a user group to which the utterer belongs, and the feature information and the group category are in a mapping relationship. For example, if the determined feature information of the vocalist includes the age of 40 and the number of the Guangdong people, the group category to which the vocalist belongs is the number of the Guangdong people of 40. And if the determined characteristic information comprises the Guangdong people, the affiliated group category is the Guangdong people.
In step 240, a corpus matching the population class is obtained.
In one embodiment, a method for constructing a corpus is provided, and the method for constructing the corpus may include: step 2401 to step 2402.
In step 2401, historical voice dialogue data of speakers belonging to the same group category is obtained, wherein the historical voice dialogue data comprises historical voice information and semantic intentions expressed by the historical voice information.
Here, the speakers belonging to the same group category refer to user groups having the same feature information, such as consistent-character user groups, consistent-age user groups, consistent-region user groups, and consistent-gender user groups. The historical voice dialog data may then include historical voice dialog data of the user with the user and historical voice dialog data of the user with the smart device, including historical voice information and semantic intent associated with the historical voice information. The semantic intention associated with the historical voice information refers to the semantic intention of the user for sending out the final realization of the historical voice information. For example, the user expresses "i are too hot", the final semantic intent of which is "turn on air conditioning". In the process of acquiring the historical dialogue data, not only the historical voice information of the user but also the semantic intention which the historical voice information is intended to express need to be acquired.
It is worth noting that the voice information of the user interacting with the intelligent device each time can be used as historical dialogue data for training the corpus. For example, a user speaks voice information that "i are too hot", the intelligent device should execute a response action of opening an air conditioner according to the voice information, but in subsequent interaction, the user does not open the air conditioner but opens a fan according to an actual semantic intention, and then the linguistic data in the corpus is corrected by using historical dialogue data of "i are too hot-open the fan", so that personalized requirements of the user are met, and the semantic intention is accurately understood.
In step 2402, by performing statistical analysis on the historical speech dialogue data, common linguistic features of the speech information when the speakers belonging to the group category express the same semantic intention are determined from the historical speech dialogue data, and an association relationship is established between the common linguistic features and the semantic intention corresponding to the common linguistic features, so that the corpus is constructed.
Here, by statistically analyzing common linguistic features in speech information used when users belonging to the same group category express the same semantic intention, it is possible to find commonalities in which users of the same group category express the same intention. And establishing an incidence relation between the common language features and the semantic intents corresponding to the common language features, thereby generating corpus data and obtaining a corpus.
The common language feature includes at least one of a multi-frequency word, a keyword, a language sentence pattern and a language word.
Fig. 3 is a schematic flowchart of constructing a corpus according to a second embodiment of the present invention, and as shown in fig. 3, in an embodiment, when the common language feature includes multiple frequency words, step 2402 may include:
determining multi-frequency words in historical voice information and semantic intentions expressed by the multi-frequency words, and establishing an association relationship between the multi-frequency words and the semantic intentions corresponding to the multi-frequency words so as to construct a corpus; the multi-frequency words are words with the occurrence frequency exceeding a preset threshold value.
Here, the multi-frequency words refer to words frequently spoken by a user in daily life, and the real intention of the user can be known through the multi-frequency words by counting the multi-frequency words in the historical voice information and associating the expressed semantic intention with the multi-frequency words. For example, if the multi-frequency word that users in the same group category are accustomed to saying is "hot", and the semantic intention corresponding to the multi-frequency word is "turn on the air conditioner", the "hot" and the "turn on the air conditioner" are associated to obtain the corpus of "hot-turn on the air conditioner".
In one embodiment, when the common linguistic feature includes a keyword, step 2402 may include:
selecting historical voice information expressing the same semantic intention from the historical voice dialogue data;
and determining a keyword capable of expressing the semantic intention from the selected historical voice information, and establishing an association relationship between the keyword and the semantic intention corresponding to the keyword so as to construct the corpus.
Here, the purpose of selecting the historical speech information expressing the same semantic intention from the historical speech dialogue data is to select different historical speech information used when users of the same group category express the same semantic intention. For example, historical speech information used by users with consistent characters to express the semantic intent of "turn on air conditioner" includes "i are too hot" and "good o is". And determining that the keyword expressing the semantic intention of opening the air conditioner is 'hot' from the historical voice information of 'i is too hot' and 'good hot', and associating the semantic intention with the keyword to obtain the corpus of 'hot-opening the air conditioner'. When the users with the same character mention 'heat' in the next voice interaction, the users are known to turn on the air conditioner, and therefore accurate recognition of semantic intentions is achieved.
In one embodiment, when the common linguistic feature comprises a linguistic pattern, step 2402 may comprise:
and counting the times of the sentence patterns corresponding to the historical voice information for expressing the same semantic intention in the historical voice dialogue data, and associating the sentence pattern with the largest number of times with the semantic intention corresponding to the sentence pattern, thereby constructing the corpus.
Here, the linguistic pattern may be just the most used historical speech information among the individual historical speech information used to express the same semantic intent. For example, the historical speech dialogue data of the Guangdong is counted, and if the historical speech information with the most used semantic intention of the Guangdong expressing "open air conditioner" is determined to be "good heat", the "good heat" is used as a language sentence pattern, and the language sentence pattern and the semantic intention corresponding to the language sentence pattern are associated, so that the linguistic data of "good heat" and open air conditioner "is obtained. After the corpus is constructed, when the voice message 'good heat a' is sent out by the speaker belonging to the group category of the Guangdong people, the semantic intention of 'opening an air conditioner' can be found from the corpus of the Guangdong people according to the text message of the 'good heat a'.
In one embodiment, when the common linguistic feature comprises a linguistic word, step 2402 may comprise:
and when the common linguistic features comprise the tone words, counting the use times of the tone words in the historical voice information used for expressing the same semantic intention in the historical voice dialogue data, and establishing a correlation relationship between the tone words with the most use times and the semantic intentions corresponding to the tone words, so as to construct the corpus.
Here, the tone word may be a tone word commonly used by the user in the daily life, such as "strike". By counting the language meaning words of users who express the same semantic intention by users of the same group category, the semantic intention of the users can be recognized when the users send out the corresponding language meaning words.
In the above embodiment, by collecting and sorting the historical speech dialogue data of users belonging to the same group category and performing dialogue learning on the historical speech dialogue data, it is possible to obtain common linguistic features of languages used when the same group category expresses the same semantic intention, thereby obtaining a corpus. The maintained corpus is directly used for a human-computer voice interaction semantic understanding process so as to improve the semantic understanding accuracy of the voice interaction of the user.
In step 250, semantic intent matching the speech information is obtained from the corpus.
Here, after converting the speech information into text information, the converted text information is used to perform matching in the corpus, thereby obtaining a semantic intention matching the speech information. After a corpus matched with the characteristic information is obtained, matching is carried out in the corpus by using the text information corresponding to the voice information, and the semantic intention which the voice information wants to express is determined according to the matching result. For example, if the feature information of the speaker who utters the voice information of "good heat" is a guangdong person, the text information of "good heat" is used to perform matching in the corpus associated with the guangdong person, and the semantic intention is "open air conditioner" by matching.
When a plurality of semantic intentions are matched, the semantic intention corresponding to the corpus with the maximum similarity is used as the semantic intention which is expressed by the speaker.
In step 260, the smart device is controlled to perform an action responsive to the semantic intent.
Here, the response action for controlling the smart device to implement the semantic intent may be to acquire response information matching the semantic intent in a preset database, in which the semantic intent and the response information associated with the semantic intent are stored. For example, the semantic intent of "turn on air conditioner" is associated with a control instruction to turn on air conditioner. The database may be a corpus, that is, data of "text-semantic intention-response information" stored in the corpus, such as "response information of too hot-turning on air conditioner".
In addition, the response action may be a control instruction for realizing semantic intention and/or a response voice, if the semantic intention is "turn on the air conditioner", the response action may be a control instruction for controlling turn-on of the air conditioner and/or a response voice, which may be "turn on the air conditioner for you", is fed back to the speaker.
In the embodiment, the group category to which the different speaker belongs can be determined according to the feature information of the different speakers, so that the corpus corresponding to the group category of the speaker is matched, and the semantic intent to be expressed by the voice information is accurately identified by using the corresponding corpus, so that the semantic intent can be accurately identified.
EXAMPLE III
According to an embodiment of the present invention, there is also provided a voice interaction system, including:
the voice acquisition module is used for acquiring voice information;
the characteristic determining module is used for determining the characteristic information of a speaker who sends the voice information;
the group category determining module is used for determining the group category to which the speaker who sends the voice information belongs according to the characteristic information;
a corpus acquisition module, configured to acquire a corpus matched with the group category;
the semantic intention determining module is used for acquiring semantic intentions matched with the voice information from the corpus;
and the control module is used for controlling the intelligent equipment to execute the action responding to the semantic intention.
Example four
According to an embodiment of the present invention, there is also provided a storage medium having program code stored thereon, which when executed by a processor, implements the voice interaction method according to any one of the above embodiments.
EXAMPLE five
According to an embodiment of the present invention, there is also provided an electronic device, which includes a memory and a processor, where the memory stores program codes executable on the processor, and when the program codes are executed by the processor, the electronic device implements the voice interaction method according to any one of the above embodiments.
EXAMPLE six
According to an embodiment of the present invention, there is also provided a voice interaction system, including:
the client is used for acquiring voice information; and
a server comprising a memory, a processor, the memory having stored thereon program code executable on the processor, the program code implementing the voice interaction method as in any one of the above embodiments when executed by the processor.
The technical solution of the present invention is described in detail with reference to the drawings, and it is considered that in the related art, recognition of semantic intentions is implemented by using a universal speech training model, so that most semantic understanding processing degrees are not high enough. The invention provides a voice interaction method, a system, a storage medium and electronic equipment, which are characterized in that feature information of a speaker who sends voice information is determined, a group category of the speaker is determined according to the feature information, a corpus matched with the group category is further acquired, and a semantic intention corresponding to the voice information is determined by utilizing the corpus so as to control intelligent equipment to execute a response action corresponding to the semantic intention. By accurately recognizing the semantic intention to be expressed by the voice information by using the corpus corresponding to the group category to which the speaker belongs, accurate recognition of the semantic intention in the voice information can be realized.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Although the embodiments of the present invention have been described above, the above description is only for the convenience of understanding the present invention, and is not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (11)

1. A method of voice interaction, comprising:
acquiring voice information;
determining characteristic information of a speaker who utters the voice information; wherein the characteristic information can be used for representing the group category to which the speaker belongs;
determining the group category to which the speaker who sends the voice information belongs according to the characteristic information;
obtaining a corpus matched with the group category;
obtaining semantic intentions matched with the voice information from the corpus;
controlling the smart device to perform an action responsive to the semantic intent.
2. The method of claim 1, wherein the corpus is created in advance by:
acquiring historical voice dialogue data of speakers belonging to the same group category, wherein the historical voice dialogue data comprises historical voice information and semantic intentions expressed by the historical voice information;
and performing statistical analysis on the historical voice dialogue data, determining common linguistic features of the historical voice information used when the vocalizers belonging to the group category express the same semantic intention from the historical voice dialogue data, and establishing an association relationship between the common linguistic features and the semantic intentions corresponding to the common linguistic features, so as to construct the corpus.
3. The method of claim 2, wherein the common linguistic feature includes at least one of a multi-frequency word, a keyword, a sentence pattern, and a mood word.
4. The voice interaction method according to claim 3, wherein the step of constructing the corpus by statistically analyzing the historical voice dialogue data, determining common linguistic features of historical voice information used when vocalizers belonging to the group category express the same semantic intention from the historical voice dialogue data, and associating the common linguistic features with the semantic intention corresponding to the common linguistic features comprises the steps of:
when the common language features comprise multi-frequency words, determining the multi-frequency words in the historical voice information and semantic intentions expressed by the multi-frequency words, and establishing an association relationship between the multi-frequency words and the semantic intentions corresponding to the multi-frequency words so as to construct the corpus; the multi-frequency words are words with the occurrence frequency exceeding a preset threshold value;
when the common language features comprise keywords, selecting historical voice information expressing the same semantic intention from the historical voice dialogue data;
determining a keyword capable of expressing the semantic intention from the selected historical voice information, and establishing an association relationship between the keyword and the semantic intention corresponding to the keyword so as to construct the corpus;
when the common linguistic features comprise linguistic patterns, counting the times of the linguistic patterns corresponding to the historical voice information used for expressing the same semantic intention in the historical voice dialogue data, and associating the linguistic pattern with the largest time with the semantic intention corresponding to the linguistic pattern so as to construct the corpus;
and when the common linguistic features comprise the tone words, counting the use times of the tone words in the historical voice information used for expressing the same semantic intention in the historical voice dialogue data, and establishing a correlation relationship between the tone words with the most use times and the semantic intentions corresponding to the tone words, so as to construct the corpus.
5. The voice interaction method according to claim 1, wherein the feature information includes at least one of age information, gender information, character information, and region information.
6. The voice interaction method according to claim 1 or 5, wherein determining feature information of a speaker who uttered the voice message comprises:
extracting voiceprint features from the voice information, and determining identity information of a speaker who sends the voice information based on the voiceprint features;
and determining the characteristic information of the speaker according to the identity information of the speaker.
7. The method of claim 5, wherein determining feature information of a speaker who uttered the voice message comprises:
when the feature information comprises age information and/or gender information, extracting sound spectrum features from the voice information, and determining the age information and/or gender information of a speaker who sends the voice information according to the sound spectrum features;
when the feature information comprises character information, determining the language expression style of the voice information, and determining the character type matched with the language expression style as character information of a speaker who sends the voice information according to the language expression style;
when the voice feature information comprises region information, extracting voice features from the voice information, and determining the region information of a speaker who sends the voice information according to the voice features; wherein the voice feature comprises at least one of intonation, pronunciation and intonation.
8. A voice interaction system, comprising:
the voice acquisition module is used for acquiring voice information;
the characteristic determining module is used for determining the characteristic information of a speaker who sends the voice information; wherein the characteristic information can be used for representing the group category to which the speaker belongs;
the group category determining module is used for determining the group category to which the speaker who sends the voice information belongs according to the characteristic information;
a corpus acquisition module, configured to acquire a corpus matched with the group category;
the semantic intention determining module is used for acquiring semantic intentions matched with the voice information from the corpus;
and the control module is used for controlling the intelligent equipment to execute the action responding to the semantic intention.
9. A storage medium having program code stored thereon, wherein the program code, when executed by a processor, implements a voice interaction method as claimed in any one of claims 1 to 7.
10. An electronic device, characterized in that the electronic device comprises a memory, a processor, the memory having stored thereon program code executable on the processor, the program code, when executed by the processor, implementing the voice interaction method according to any one of claims 1 to 7.
11. A voice interaction system, comprising:
the client is used for acquiring voice information; and
a server comprising a memory, a processor, the memory having stored thereon program code executable on the processor, the program code implementing the method of voice interaction according to any one of claims 1 to 7 when executed by the processor.
CN202010546399.9A 2020-06-15 2020-06-15 Voice interaction method, system, storage medium and electronic equipment Pending CN113808575A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010546399.9A CN113808575A (en) 2020-06-15 2020-06-15 Voice interaction method, system, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010546399.9A CN113808575A (en) 2020-06-15 2020-06-15 Voice interaction method, system, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN113808575A true CN113808575A (en) 2021-12-17

Family

ID=78892495

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010546399.9A Pending CN113808575A (en) 2020-06-15 2020-06-15 Voice interaction method, system, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113808575A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006098566A (en) * 2004-09-28 2006-04-13 Clarion Co Ltd Speech recognition system
CN104795067A (en) * 2014-01-20 2015-07-22 华为技术有限公司 Voice interaction method and device
CN105096938A (en) * 2015-06-30 2015-11-25 百度在线网络技术(北京)有限公司 Method and device for obtaining user characteristic information of user
CN107024931A (en) * 2016-01-29 2017-08-08 通用汽车环球科技运作有限责任公司 Speech recognition system and method for automatic Pilot
CN109346078A (en) * 2018-11-09 2019-02-15 泰康保险集团股份有限公司 Voice interactive method, device and electronic equipment, computer-readable medium
CN109545218A (en) * 2019-01-08 2019-03-29 广东小天才科技有限公司 A kind of audio recognition method and system
CN109920429A (en) * 2017-12-13 2019-06-21 上海擎感智能科技有限公司 It is a kind of for vehicle-mounted voice recognition data processing method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006098566A (en) * 2004-09-28 2006-04-13 Clarion Co Ltd Speech recognition system
CN104795067A (en) * 2014-01-20 2015-07-22 华为技术有限公司 Voice interaction method and device
CN105096938A (en) * 2015-06-30 2015-11-25 百度在线网络技术(北京)有限公司 Method and device for obtaining user characteristic information of user
CN107024931A (en) * 2016-01-29 2017-08-08 通用汽车环球科技运作有限责任公司 Speech recognition system and method for automatic Pilot
CN109920429A (en) * 2017-12-13 2019-06-21 上海擎感智能科技有限公司 It is a kind of for vehicle-mounted voice recognition data processing method and system
CN109346078A (en) * 2018-11-09 2019-02-15 泰康保险集团股份有限公司 Voice interactive method, device and electronic equipment, computer-readable medium
CN109545218A (en) * 2019-01-08 2019-03-29 广东小天才科技有限公司 A kind of audio recognition method and system

Similar Documents

Publication Publication Date Title
US11496582B2 (en) Generation of automated message responses
US11990127B2 (en) User recognition for speech processing systems
US10027662B1 (en) Dynamic user authentication
US11594215B2 (en) Contextual voice user interface
US10140973B1 (en) Text-to-speech processing using previously speech processed data
US11798556B2 (en) Configurable output data formats
CN106782560B (en) Method and device for determining target recognition text
US10713289B1 (en) Question answering system
US10163436B1 (en) Training a speech processing system using spoken utterances
US20240153505A1 (en) Proactive command framework
US10176809B1 (en) Customized compression and decompression of audio data
WO2020123227A1 (en) Speech processing system
US10515637B1 (en) Dynamic speech processing
US20240144933A1 (en) Voice-controlled communication requests and responses
US11715472B2 (en) Speech-processing system
CN108735200A (en) A kind of speaker's automatic marking method
CN111986675A (en) Voice conversation method, device and computer readable storage medium
CN114708869A (en) Voice interaction method and device and electric appliance
CN110852075A (en) Voice transcription method and device for automatically adding punctuation marks and readable storage medium
DE112021000292T5 (en) VOICE PROCESSING SYSTEM
Huang et al. Unsupervised discriminative training with application to dialect classification
Shahin Speaking style authentication using suprasegmental hidden Markov models
Venkatagiri Speech recognition technology applications in communication disorders
CN113808575A (en) Voice interaction method, system, storage medium and electronic equipment
CN115424606A (en) Voice interaction method, voice interaction device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination