WO2021081744A1 - 语音信息处理方法、装置、设备及存储介质 - Google Patents
语音信息处理方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2021081744A1 WO2021081744A1 PCT/CN2019/113943 CN2019113943W WO2021081744A1 WO 2021081744 A1 WO2021081744 A1 WO 2021081744A1 CN 2019113943 W CN2019113943 W CN 2019113943W WO 2021081744 A1 WO2021081744 A1 WO 2021081744A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- target
- voice information
- skill
- role
- Prior art date
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 32
- 238000003672 processing method Methods 0.000 title claims abstract description 19
- 238000000034 method Methods 0.000 claims abstract description 46
- 238000013507 mapping Methods 0.000 claims description 80
- 238000004590 computer program Methods 0.000 claims description 12
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 230000004044 response Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 12
- 238000005516 engineering process Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 230000003993 interaction Effects 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- KLDZYURQCUYZBL-UHFFFAOYSA-N 2-[3-[(2-hydroxyphenyl)methylideneamino]propyliminomethyl]phenol Chemical compound OC1=CC=CC=C1C=NCCCN=CC1=CC=CC=C1O KLDZYURQCUYZBL-UHFFFAOYSA-N 0.000 description 1
- 201000001098 delayed sleep phase syndrome Diseases 0.000 description 1
- 208000033921 delayed sleep phase type circadian rhythm sleep disease Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
Definitions
- This application relates to voice technology, and in particular to a voice information processing method, device, equipment, and storage medium.
- Smart voice assistants have been widely used in mobile phones, vehicle terminals, smart homes, and other products, freeing users' hands. Users only need to interact with the smart voice assistant through voice to control and operate the product functions.
- the text to speech (TTS) in the smart voice solution can only provide a role with a single timbre.
- the timbre is single, which lacks an interesting and anthropomorphic interactive process.
- the embodiments of the present application expect to provide a voice information processing method, device, equipment, and storage medium.
- a voice information processing method including:
- Acquiring voice information collected by a voice collecting unit wherein the voice information includes first voice information, and the first voice information is used to indicate the invocation of the target skill;
- the first mapping relationship includes at least three mapping relationships between roles and skills
- the method before the controlling the first target role to perform voice broadcast for the target skill, the method further includes: determining a second target role currently performing voice broadcast; the second target role and the When the first target role is different, the second target role currently performing the voice broadcast is switched to the first target role.
- the voice information further includes second voice information
- the second voice information is used to instruct to wake up the second target role; before determining the second target role currently performing voice broadcast, the method further The method includes: recognizing the second voice information from the voice information, and determining a second target character whose wake-up is indicated by the second voice information; and controlling the second target character to perform voice broadcast.
- the determining the second target role of the wake-up indicated by the second voice information includes: determining the wake-up identifier in the second voice information; and determining the corresponding wake-up identifier from the preset second mapping relationship The second target role; wherein the second mapping relationship includes at least three mapping relationships between roles and wake-up identifiers.
- the controlling the first target character to perform the voice broadcast for the target skill includes: acquiring the timbre information of the first target character and the voice text information; wherein different roles correspond to different timbre information; Synthesize voice and audio information based on the timbre information of the first target character and the voice text information; control the voice output unit to output the voice and audio information.
- the method further includes: obtaining third voice information; wherein, the third voice information is used to instruct to quit the current execution The first target role of the voice broadcast; based on the third voice information, control to exit the first target role.
- the controlling to withdraw from the first target role based on the third voice information includes: determining an exit identifier in the third voice information; and determining all the characters from a preset third mapping relationship table.
- a voice information processing device including:
- the acquiring part is configured to acquire the voice information collected by the voice collection unit; wherein the voice information includes first voice information, and the first voice information is used to indicate the invocation of the target skill;
- the processing part is configured to recognize the first voice information based on a preset skill recognition strategy, and determine the target skill that the first voice information indicates to call;
- the processing part is configured to determine a first target role corresponding to the target skill from a preset first mapping relationship; wherein the first mapping relationship includes at least three mapping relationships between roles and skills;
- the control part is configured to control the first target character to perform voice broadcast for the target skill.
- a voice information processing device including: a processor and a memory configured to store a computer program that can run on the processor,
- the processor is configured to execute the steps of any one of the aforementioned methods when running the computer program.
- a computer-readable storage medium is provided, and a computer program is stored thereon, and when the computer program is executed by a processor, the steps of the method described in any one of the foregoing are implemented.
- the voice information processing method, device, equipment, and storage medium acquire the voice information collected by the voice collection unit; wherein, the voice information includes first voice information, and the first voice information is used to indicate the call Target skills; based on a preset skill recognition strategy, recognize the target skills indicated by the first voice information; then determine the first target role to achieve the target skills; control the first target role to perform the voice for the target skills Broadcast, determine the user's intention expressed in the voice information through intention judgment, and determine the target skill that the user wants to call according to the user's intention, so as to wake up the target role corresponding to the target skill.
- the role wake-up can be realized more smoothly, the intelligence of voice control can be improved, and multiple roles can be configured to perform voice broadcasting of different skills, which improves the fun of voice control.
- FIG. 1 is a schematic diagram of the first flow of a voice information processing method in an embodiment of this application
- FIG. 2 is a schematic diagram of a second flow of a voice information processing method in an embodiment of this application;
- FIG. 3 is a schematic diagram of the third process of the voice information processing method in an embodiment of this application.
- FIG. 4 is a schematic diagram of the composition structure of a voice processing system in an embodiment of the application.
- FIG. 5 is a schematic diagram of the composition structure of the skill processing system in an embodiment of the application.
- FIG. 6 is a schematic diagram of the composition structure of a voice information processing device in an embodiment of the application.
- FIG. 7 is a schematic diagram of the composition structure of a voice information processing device in an embodiment of the application.
- FIG. 1 is a schematic diagram of the first flow of the voice information processing method in the embodiment of the present application. As shown in FIG. 1, the method may specifically include:
- Step 101 Acquire voice information collected by a voice collection unit; wherein the voice information includes first voice information, and the first voice information is used to indicate the invocation of the target skill;
- Step 102 Recognizing the first voice information based on a preset skill recognition strategy, and determining that the first voice information indicates the target skill to be invoked;
- Step 103 Determine the first target role corresponding to the target skill from the preset first mapping relationship; wherein, the first mapping relationship includes at least three mapping relationships between roles and skills;
- Step 104 Control the first target character to perform voice broadcast for the target skill.
- the execution subject of step 101 to step 104 may be the processor of the voice information processing device.
- the voice information processing device may be located on the server side or the terminal side.
- the terminal can be a mobile terminal or a fixed terminal with a voice control function.
- smart phones personal computers (such as tablets, desktop computers, notebooks, netbooks, handheld computers), mobile phones, e-book readers, portable multimedia players, audio/video players, cameras, virtual reality devices, and wearables Equipment, etc.
- the voice collection device may be a microphone.
- a microphone located on a terminal collects voice information and executes the steps of the above-mentioned voice information processing method locally on the terminal; or uploads the voice information to the server, and the server executes the steps of the above-mentioned voice information processing method, and the server delivers the processing result to the terminal ,
- the terminal executes the corresponding voice output control operation according to the processing result.
- the preset skill recognition strategy is used for skill recognition to determine the target skill that the user needs to control the terminal or other electronic equipment through the voice information.
- the target skill is the skill determined according to the user's intention expressed by the first voice information.
- the first voice message is "What's the weather today?"
- the user's intention is to query the weather, and then the corresponding target skill is determined to be "weather query", and the query date is "today”.
- the user does not need to say a control keyword such as "check the weather”, but uses a more colloquial language to realize the voice control function, the control process is more intelligent, and it conforms to the user's daily communication habits.
- the recognizing the first voice information based on the preset skill recognition strategy includes: using voice recognition technology to perform text recognition on the first voice information to obtain the first text information; using semantic recognition technology to perform text recognition on the first text information Perform semantic recognition, and obtain the target skill that the first voice information indicates to call.
- the method further includes: acquiring at least one skill that can be realized by the at least three characters; and establishing a first mapping relationship by using the mapping relationship between the at least three characters and the at least one skill.
- At least one skill that can be achieved by at least three roles is acquired; at least one skill that can be achieved by each role is used to establish a skill set; and the mapping relationship between at least three roles and the skill set is used to establish a first mapping relationship.
- the first mapping relationship one type of role corresponds to at least one type of skill, and all skills corresponding to one type of role form a skill set.
- the first mapping relationship may include the mapping relationship between roles and skills, or the mapping relationship between roles and skill sets.
- the first target role corresponding to the target skill may be directly determined from the first mapping relationship, or the skill set where the target skill is located is determined first, and then the first target role corresponding to the skill set may be determined from the first mapping relationship.
- the voice role configured on the terminal can be a role developed by the terminal manufacturer, or a third-party role developed by a third-party manufacturer.
- the third-party role can be called by downloading a third-party application, or there is no need to download a third-party application.
- the program calls a third-party role through online access.
- role A corresponds to skill set A
- the skills included in set A include "weather query, broadcast broadcasting, audio e-book broadcasting, etc.”
- Role B corresponds to skill set B, and the skills included in set B include "music playback, music video playback, music anchor live broadcast, etc.”;
- Role C corresponds to skill set C, and the skills included in set C include "information query, information recommendation, information download, etc.”.
- the above-mentioned role A can be a role developed by the terminal manufacturer to realize the voice control operation of its own application A, and the role B and role C can be self-developed roles of other terminals to implement application B and application C. Voice control operation.
- multiple roles are configured to perform voice broadcasts of different skills, which improves the interest of voice control.
- the first target role corresponding to the target skill is determined from the preset first mapping relationship; wherein, the first mapping relationship includes at least three mapping relationships between roles and skills. Specifically, match the skills in the first mapping relationship of the target skills to determine the corresponding first target role when the matching is successful; or match the target skills with the skill set corresponding to the role, determine the skill set containing the target skills, and thus determine The first target role corresponding to the skill set.
- controlling the first target character to perform voice broadcast for the target skill includes: acquiring timbre information and voice text information of the first target character; wherein different roles correspond to different timbre information; The timbre information of the target character and the voice text information are synthesized into voice audio information; the voice output unit is controlled to output the voice audio information.
- the voice text to be output is synthesized with the timbre of the first target character to synthesize voice audio with the timbre of the first target character Information is output through voice output units such as speakers or earphones.
- the voice information processing method, device, equipment, and storage medium acquire the voice information collected by the voice collection unit; wherein, the voice information includes first voice information, and the first voice information is used to indicate the call Target skills; based on a preset skill recognition strategy, recognize the target skills indicated by the first voice information; then determine the first target role to achieve the target skills; control the first target role to perform the voice for the target skills Broadcast, determine the user's intention expressed in the voice information through intention judgment, and determine the target skill that the user wants to call according to the user's intention, so as to wake up the target role corresponding to the target skill. In this way, the role awakening can be realized more smoothly, the intelligence of voice control can be improved, and the voice broadcast of different skills can be performed by configuring multiple roles, which improves the fun of voice control.
- FIG. 2 is a schematic diagram of the second flow of the method for processing voice information in an embodiment of this application. As shown in FIG. 2, the method includes:
- Step 201 Acquire voice information collected by a voice collection unit; wherein the voice information includes first voice information, and the first voice information is used to indicate the invocation of the target skill;
- the voice collection device may be a microphone.
- a microphone located on a terminal collects voice information and executes the steps of the above-mentioned voice information processing method locally on the terminal; or uploads the voice information to the server, and the server executes the steps of the above-mentioned voice information processing method, and the server delivers the processing result to the terminal ,
- the terminal executes the corresponding voice output control operation according to the processing result.
- Step 202 Recognize the first voice information based on a preset skill recognition strategy, and determine the target skill that the first voice information indicates to call;
- the preset skill recognition strategy is used for skill recognition to determine the target skill that the user needs to control the terminal or other electronic equipment through the voice information.
- the target skill is the skill determined according to the user's intention expressed by the first voice information.
- the first voice message is "What's the weather today?"
- the user's intention is to query the weather, and then the corresponding target skill is determined to be "weather query", and the query date is "today”.
- the user does not need to say a control keyword such as "check the weather”, but uses a more colloquial language to realize the voice control function, the control process is more intelligent, and it conforms to the user's daily communication habits.
- the recognizing the first voice information based on the preset skill recognition strategy includes: using voice recognition technology to perform text recognition on the first voice information to obtain the first text information; using semantic recognition technology to perform text recognition on the first text information Perform semantic recognition, and obtain the target skill that the first voice information indicates to call.
- Step 203 Determine the first target role corresponding to the target skill from the preset first mapping relationship; wherein, the first mapping relationship includes at least three mapping relationships between roles and skills;
- the first mapping relationship may include a mapping relationship between a role and a skill, or a mapping relationship between a role and a skill set.
- the first target role corresponding to the target skill can be directly determined from the first mapping relationship, or the skill set where the target skill is located is determined first, and then the first target role corresponding to the skill set is determined from the first mapping relationship.
- step 203 may specifically include: matching the skills in the first mapping relationship of the target skills to determine the corresponding first target role when the matching is successful; or matching the target skills with the skill set corresponding to the role to determine that the target skills are included To determine the first target role corresponding to the skill set.
- the voice role configured on the terminal can be a role developed by the terminal manufacturer, or a third-party role developed by a third-party manufacturer.
- the third-party role can be called by downloading a third-party application, or there is no need to download a third-party application.
- the program calls a third-party role through online access.
- Step 204 Determine the second target role currently performing voice broadcast
- the second target character is a character that is performing voice broadcast before awakening the first target character.
- the terminal device is using the second target role for voice communication with the user, it is judged according to the user's speaking intention that the user needs the first target role to perform the target skills, and the first target role can be summoned through the second target role.
- determining the second target role currently performing voice broadcast includes: detecting the currently performing voice broadcast role, and determining the second target role. For example, detecting the role identification bit of the currently performing voice broadcast role, and determining the second target role that is performing the voice broadcast.
- a role needs to be awakened for initial communication with the user, for example, the user directly awakens the first target role corresponding to the target skill; or wakes up the system default role; Or wake up roles that users often use.
- the voice may be awakened.
- the voice information further includes second voice information, and the second voice information is used to instruct to wake up the second target character; accordingly, the determining that the voice is currently being performed Before broadcasting the second target character, the method further includes: recognizing the second voice information from the voice information, and determining the second target character whose wakeup is indicated by the second voice information; controlling the second target The role performs voice broadcast.
- the determining the second target role of the wake-up indicated by the second voice information includes: determining the wake-up identifier in the second voice information; and determining all the wake-up identifiers corresponding to the wake-up identifier from the preset second mapping relationship.
- the role and the wake-up identifier in the second mapping relationship are one-to-one mapping or one-to-many mapping. That is, a character can only be awakened by one wake-up sign, or a character can be awakened by multiple wake-up signs.
- the wake-up flags for different roles can be uniformly specified by the manufacturer, or set by the user according to habits or preferences.
- the wake-up flag corresponding to role A is "A classmate, little A hello, foodie little A";
- the wake-up flag corresponding to character B is "Fat B classmate"
- the wake-up flag corresponding to role C is "Hello old C, hey old C".
- acquiring the voice information collected by the voice collection unit includes: simultaneously acquiring the first voice information and the second voice information, or first acquiring the first voice information, and determining the second target role that the second voice information indicates to wake up ; Control the second target role to perform voice broadcast; then acquire second voice information, and determine the target skill that the second voice information indicates to call.
- Step 205 Determine whether the second target role is the same as the first target role, if not, go to step 206; if yes, go to step 207;
- Step 206 When the second target role is different from the first target role, switch the second target role currently performing voice broadcast to the first target role;
- the method when the second target role and the first target role are different, the method further includes: generating switching prompt information; controlling the first target role or the second target role to play the switching Prompt information.
- the switching prompt information is used to remind the user that the role switching operation will be performed or that the role switching operation has been performed.
- the controlling the first target role or the second target role to play the switching prompt information includes: before the switching, controlling the second target role to play the switching prompt information; after the switching, controlling the first target role Play the switching prompt information.
- the switching prompt message can be played before the switching operation to remind the user that the second target character is about to summon the first target character to perform voice broadcast.
- the voice assistant of character A is awakened by the wake word of character A, and the skill classifier is used to actually determine that the target skill is the skill of character B. Smooth sentences can be added and the voice of character A can be used to broadcast "Questions about XX can be Ask character B", and then feed back the actual result of character B.
- the switching prompt message may be played after the switching operation to remind the user that the voice broadcast is now performed by the first target role.
- the wake word of character A to wake up the voice assistant of character A, and use the skill classifier to actually determine that the target skill is the skill of character B.
- closed-loop dialogues are allowed between different roles, making role switching smoother, more in line with users' daily communication habits, and enhancing the intelligence level of voice control.
- step 207 is also executed after the switching is completed.
- Step 207 Control the first target character to perform voice broadcast for the target skill.
- controlling the first target role to perform voice broadcasting means controlling the second target role to perform voice broadcasting.
- This step may specifically include: acquiring timbre information and voice text information of the first target character; wherein different roles correspond to different timbre information; synthesizing voice and audio information based on the timbre information of the first target character and the voice text information ; Control the voice output unit to output the voice audio information.
- the voice text to be output is synthesized with the timbre of the first target character to synthesize voice audio with the timbre of the first target character Information is output through voice output units such as speakers or earphones.
- FIG. 3 is a schematic diagram of the third process of the method for processing voice information in an embodiment of this application. As shown in FIG. 3, the method includes:
- Step 301 Acquire voice information collected by a voice collection unit; wherein the voice information includes first voice information, and the first voice information is used to indicate the invocation of the target skill;
- the voice collection device may be a microphone.
- a microphone located on a terminal collects voice information and executes the steps of the above-mentioned voice information processing method locally on the terminal; or uploads the voice information to the server, and the server executes the steps of the above-mentioned voice information processing method, and the server delivers the processing result to the terminal ,
- the terminal executes the corresponding voice output control operation according to the processing result.
- Step 302 Recognizing the first voice information based on a preset skill recognition strategy, and determining that the first voice information indicates the target skill to be invoked;
- the preset skill recognition strategy is used for skill recognition to determine the target skill that the user needs to control the terminal or other electronic equipment through the voice information.
- the target skill is the skill determined according to the user's intention expressed by the first voice information.
- the first voice message is "What's the weather today?"
- the user's intention is to query the weather, and then the corresponding target skill is determined to be "weather query", and the query date is "today”.
- the user does not need to say a control keyword such as "check the weather”, but uses a more colloquial language to realize the voice control function, the control process is more intelligent, and it conforms to the user's daily communication habits.
- the recognizing the first voice information based on the preset skill recognition strategy includes: using voice recognition technology to perform text recognition on the first voice information to obtain the first text information; using semantic recognition technology to perform text recognition on the first text information Perform semantic recognition, and obtain the target skill that the first voice information indicates to call.
- Step 303 Determine the first target role corresponding to the target skill from the preset first mapping relationship; wherein, the first mapping relationship includes at least three mapping relationships between roles and skills;
- the first mapping relationship may include a mapping relationship between a role and a skill, or a mapping relationship between a role and a skill set.
- the first target role corresponding to the target skill may be directly determined from the first mapping relationship, or the skill set where the target skill is located is determined first, and then the first target role corresponding to the skill set may be determined from the first mapping relationship.
- step 303 may specifically include: matching the skills in the first mapping relationship of the target skills to determine the corresponding first target role when the matching is successful; or matching the target skills with the skill set corresponding to the role to determine that the target skills are included To determine the first target role corresponding to the skill set.
- the voice role configured on the terminal can be a role developed by the terminal manufacturer, or a third-party role developed by a third-party manufacturer.
- the third-party role can be called by downloading a third-party application, or there is no need to download a third-party application.
- the program calls a third-party role through online access, and expands the range of voice skills by calling the third-party role, and improves the processing effect of the user's voice information.
- Step 304 Control the first target character to perform voice broadcast for the target skill
- This step may specifically include: acquiring timbre information and voice text information of the first target character; wherein different roles correspond to different timbre information; synthesizing voice and audio information based on the timbre information of the first target character and the voice text information ; Control the voice output unit to output the voice audio information.
- the voice text to be output is synthesized with the timbre of the first target character to synthesize voice audio with the timbre of the first target character Information is output through voice output units such as speakers or earphones.
- Step 305 Acquire third voice information; wherein, the third voice information is used to instruct to quit the first target role currently performing voice broadcast;
- Step 306 Based on the third voice information, control to exit the first target role.
- this step specifically includes: determining the exit identifier in the third voice information; determining the first target role corresponding to the exit identifier from a preset third mapping relationship table; wherein, the The third mapping relationship includes at least three mapping relationships between roles and exit identifiers; the first target role is controlled to exit.
- the role and exit identifier in the third mapping relationship are one-to-one mapping or one-to-many mapping. That is, a role can only be exited by one exit identifier, or a role can be exited by multiple exit identifiers.
- the exit signs for different roles can be uniformly specified by the manufacturer, or set by the user according to habits or preferences.
- exit identifier corresponding to role A is "Exit classmate A, small A retreat, walk away foodie small A";
- the exit sign corresponding to role B is "Go away fat student B";
- the exit identifier corresponding to role C is "Exit old C, goodbye old C".
- exit identifiers are associated with different roles to improve the flexibility of role control.
- FIG. 4 is a schematic diagram of the composition structure of the voice processing system in an embodiment of the application.
- the voice processing system includes: a voice assistant client 401, Voice assistant central control server 402, recognition server 403, skill recognizer 404, role A server 405, role B server 406, and role C server 407.
- the user realizes the collection, uploading, and receiving of voice output results of voice information (including audio and role wake-up words) and outputting voice information of different roles.
- voice information including audio and role wake-up words
- the audio is the first voice information
- the character wake-up word is the second voice information.
- the voice assistant central control server 402 to the role C server 407 are used to process voice data.
- the voice assistant central control server 402 is configured to receive the voice information uploaded by the voice assistant client 401, use voice recognition technology to perform text recognition on the voice information to obtain the recognized text; send the recognized text to the skill classifier 404;
- Skill classifier 404 uses semantic recognition technology to perform semantic understanding of text information to determine the target skill.
- the role A server 405 is used to perform the skill A service
- the role B server 406 is used to perform the service.
- Skill B service when the target skill is skill C, the role C server 407 is used to perform skill C service.
- the role A server 405 processes skill A, and sends the obtained skill A intention result, skill A resource service result, response text, and role A response audio to the voice assistant central control server 402;
- the role B server 406 processes skill B, and sends the obtained skill B intent result, skill B resource service result, response text, and role B response audio to the voice assistant central control server 402;
- the role C server 407 processes skill C, and sends the obtained skill C intention result, skill C resource service result, response text, and role C response audio to the voice assistant central control server 402;
- the voice assistant central control server 402 performs voice synthesis according to the received processing result to generate a voice output result, and sends the voice output result to the voice assistant client 401; the voice assistant client 401 controls the output of the voice output result.
- FIG. 5 is a schematic diagram of the composition structure of the skill processing system in the embodiment of the application. As shown in FIG. 5, the system includes: a skill server 501, a semantic understanding server 502, a resource recall server 503, and a TTS server 504.
- the skills server 501 sends the received recognition text to the semantic understanding server 502, and the semantic understanding server 502 performs semantic understanding on the recognized text, obtains the user's intention result, and returns the intention result to the skills server 501;
- the skill server 501 sends the intent result to the resource recall server 503, and the resource recall server 503 determines the resource service result and response text according to the intent result and sends them to the skill server 501;
- the skills server 501 then sends the response text to the TTS server 504, and the TTS server performs speech synthesis according to the character tone and the response text to generate response audio, and returns the response audio to the skills server 501;
- the skill server 501 sends the obtained intent result, resource service result, response text, and response audio to the voice assistant central control server.
- the voice assistant client is located on the terminal side, and the terminal side further includes a voice collection unit for collecting voice data; other servers that implement voice information processing are located on the server side.
- part or all of the servers that implement voice information processing may also be located on the terminal side.
- An embodiment of the present application also provides a voice information processing device. As shown in FIG. 6, the device includes:
- the acquiring part 601 is configured to acquire voice information collected by a voice collecting unit; wherein the voice information includes first voice information, and the first voice information is used to indicate the invocation of the target skill;
- the processing part 602 is configured to recognize the first voice information based on a preset skill recognition strategy, and determine the target skill that the first voice information indicates to call;
- the processing part 602 is configured to determine a first target role corresponding to the target skill from a preset first mapping relationship; wherein, the first mapping relationship includes at least three mapping relationships between roles and skills;
- the control part 603 is configured to control the first target character to perform voice broadcast for the target skill.
- the processing part before the controlling the first target role to perform the voice broadcast for the target skill, the processing part is configured to determine the second target role currently performing the voice broadcast;
- the device further includes: a switching part configured to switch the second target role currently performing voice broadcast to the first target role when the second target role is different from the first target role.
- the voice information further includes second voice information, the second voice information is used to instruct to wake up the second target role; the processing part is configured to recognize the voice information Second voice information, and determine the second target role that the second voice information indicates to wake up; control the second target role to perform voice broadcast.
- the processing part is configured to determine the wake-up identifier in the second voice information; determine the second target role corresponding to the wake-up identifier from the preset second mapping relationship; wherein, The second mapping relationship includes at least three mapping relationships between roles and wakeup identifiers.
- control part is configured to obtain the timbre information and voice text information of the first target character; wherein, different roles correspond to different timbre information; based on the timbre information of the first target character and the voice Text information, synthesized voice and audio information; control the voice output unit to output the voice and audio information.
- the acquiring part is configured to acquire third voice information; wherein, the third voice information is used to indicate to quit the first target role currently performing voice broadcasting; and the control part is configured to be based on The third voice information controls to exit the first target role.
- the processing part is configured to determine the exit identifier in the third voice information; determine the first target role corresponding to the exit identifier from a preset third mapping relationship table; wherein, The third mapping relationship includes at least three mapping relationships between roles and exit identifiers; the control part is configured to control the exit of the first target role.
- the embodiment of the present application also provides a voice information processing device.
- the device includes: a processor 701 and a memory 702 configured to store a computer program that can run on the processor; the processor 701 runs the memory 702
- the steps of the method in the foregoing embodiments are implemented when the computer program is installed.
- bus system 703 is used to implement connection and communication between these components.
- the bus system 703 also includes a power bus, a control bus, and a status signal bus.
- various buses are marked as the bus system 703 in FIG. 7.
- An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the method described in any of the foregoing embodiments are implemented.
- the above-mentioned processors can be application-specific integrated circuits (ASIC, Application Specific Integrated Circuit), digital signal processing devices (DSPD, Digital Signal Processing Device), programmable logic devices (PLD, Programmable Logic Device), and on-site At least one of Field-Programmable Gate Array (FPGA), controller, microcontroller, and microprocessor.
- ASIC Application Specific Integrated Circuit
- DSPD digital signal processing devices
- PLD programmable logic devices
- FPGA Field-Programmable Gate Array
- controller microcontroller
- microprocessor microprocessor
- the aforementioned memory may be a volatile memory (volatile memory), such as a random access memory (RAM, Random-Access Memory); or a non-volatile memory (non-volatile memory), such as a read-only memory (ROM, Read-Only Memory), flash memory (flash memory), hard disk (HDD, Hard Disk Drive) or solid state drive (SSD, Solid-State Drive); or a combination of the above types of memory, and provides instructions and data to the processor.
- volatile memory such as a random access memory (RAM, Random-Access Memory
- non-volatile memory such as a read-only memory (ROM, Read-Only Memory), flash memory (flash memory), hard disk (HDD, Hard Disk Drive) or solid state drive (SSD, Solid-State Drive
- SSD Solid-State Drive
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Description
Claims (10)
- 一种语音信息处理方法,包括:获取语音采集单元采集的语音信息;其中,所述语音信息包括第一语音信息,所述第一语音信息用于指示调用目标技能;基于预设的技能识别策略识别所述第一语音信息,确定所述第一语音信息指示调用的目标技能;从预设的第一映射关系中,确定所述目标技能对应的第一目标角色;其中,所述第一映射关系中包括至少三种角色与技能的映射关系;控制所述第一目标角色执行针对所述目标技能的语音播报。
- 根据权利要求1所述的方法,其中,所述控制所述第一目标角色执行针对所述目标技能的语音播报之前,所述方法还包括:确定当前执行语音播报的第二目标角色;所述第二目标角色和所述第一目标角色不同时,将当前执行语音播报的所述第二目标角色切换为所述第一目标角色。
- 根据权利要求2所述的方法,其特征在于,所述语音信息还包括第二语音信息,所述第二语音信息用于指示唤醒所述第二目标角色;所述确定当前执行语音播报的第二目标角色之前,所述方法还包括:从所述语音信息中识别所述第二语音信息,并确定所述第二语音信息指示唤醒的第二目标角色;控制所述第二目标角色执行语音播报。
- 根据权利要求3所述的方法,其特征在于,所述确定所述第二语音信息指示唤醒的第二目标角色,包括:确定第二语音信息中的唤醒标识;从预设的第二映射关系中,确定所述唤醒标识对应的所述第二目标角色;其中,所述第二映射关系中包含至少三种角色与唤醒标识的映射 关系。
- 根据权利要求1所述的方法,其特征在于,所述控制所述第一目标角色执行针对所述目标技能的语音播报,包括:获取所述第一目标角色的音色信息,以及语音文本信息;其中,不同角色对应不同音色信息;基于第一目标角色的音色信息和所述语音文本信息,合成语音音频信息;控制语音输出单元输出所述语音音频信息。
- 根据权利要求1至5任一项所述的方法,其特征在于,所述控制所述第一目标角色执行针对所述目标技能的语音播报之后,所述方法还包括:获取第三语音信息;其中,所述第三语音信息用于指示退出当前执行语音播报的第一目标角色;基于所述第三语音信息,控制退出所述第一目标角色。
- 根据权利要求6所述的方法,其特征在于,所述基于所述第三语音信息,控制退出所述第一目标角色,包括:确定所述第三语音信息中的退出标识;从预设的第三映射关系表中,确定所述退出标识对应的第一目标角色;其中,所述第三映射关系中包含至少三种角色与退出标识的映射关系;控制退出所述第一目标角色。
- 一种语音信息处理装置,包括:获取部分,配置为获取语音采集单元采集的语音信息;其中,所述语音信息包括第一语音信息,所述第一语音信息用于指示调用目标技能;处理部分,配置为基于预设的技能识别策略识别所述第一语音信息, 确定所述第一语音信息指示调用的目标技能;所述处理部分,配置为从预设的第一映射关系中,确定所述目标技能对应的第一目标角色;其中,所述第一映射关系中包括至少三种角色与技能的映射关系;控制部分,配置为控制所述第一目标角色执行针对所述目标技能的语音播报。
- 一种语音信息处理设备,包括:处理器和配置为存储能够在处理器上运行的计算机程序的存储器,其中,所述处理器配置为运行所述计算机程序时,执行权利要求1至7任一项所述方法的步骤。
- 一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现权利要求1至7任一项所述的方法的步骤。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201980099978.9A CN114391165A (zh) | 2019-10-29 | 2019-10-29 | 语音信息处理方法、装置、设备及存储介质 |
PCT/CN2019/113943 WO2021081744A1 (zh) | 2019-10-29 | 2019-10-29 | 语音信息处理方法、装置、设备及存储介质 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2019/113943 WO2021081744A1 (zh) | 2019-10-29 | 2019-10-29 | 语音信息处理方法、装置、设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021081744A1 true WO2021081744A1 (zh) | 2021-05-06 |
Family
ID=75714767
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/113943 WO2021081744A1 (zh) | 2019-10-29 | 2019-10-29 | 语音信息处理方法、装置、设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114391165A (zh) |
WO (1) | WO2021081744A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115001890A (zh) * | 2022-05-31 | 2022-09-02 | 四川虹美智能科技有限公司 | 基于免应答的智能家电控制方法及装置 |
WO2022247825A1 (zh) * | 2021-05-24 | 2022-12-01 | 维沃移动通信有限公司 | 信息播报方法和电子设备 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104464716A (zh) * | 2014-11-20 | 2015-03-25 | 北京云知声信息技术有限公司 | 一种语音播报系统和方法 |
US20170083285A1 (en) * | 2015-09-21 | 2017-03-23 | Amazon Technologies, Inc. | Device selection for providing a response |
CN107122179A (zh) * | 2017-03-31 | 2017-09-01 | 阿里巴巴集团控股有限公司 | 语音的功能控制方法和装置 |
CN108735211A (zh) * | 2018-05-16 | 2018-11-02 | 智车优行科技(北京)有限公司 | 语音处理方法、装置、车辆、电子设备、程序及介质 |
CN109524010A (zh) * | 2018-12-24 | 2019-03-26 | 出门问问信息科技有限公司 | 一种语音控制方法、装置、设备及存储介质 |
-
2019
- 2019-10-29 WO PCT/CN2019/113943 patent/WO2021081744A1/zh active Application Filing
- 2019-10-29 CN CN201980099978.9A patent/CN114391165A/zh active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104464716A (zh) * | 2014-11-20 | 2015-03-25 | 北京云知声信息技术有限公司 | 一种语音播报系统和方法 |
US20170083285A1 (en) * | 2015-09-21 | 2017-03-23 | Amazon Technologies, Inc. | Device selection for providing a response |
CN107122179A (zh) * | 2017-03-31 | 2017-09-01 | 阿里巴巴集团控股有限公司 | 语音的功能控制方法和装置 |
CN108735211A (zh) * | 2018-05-16 | 2018-11-02 | 智车优行科技(北京)有限公司 | 语音处理方法、装置、车辆、电子设备、程序及介质 |
CN109524010A (zh) * | 2018-12-24 | 2019-03-26 | 出门问问信息科技有限公司 | 一种语音控制方法、装置、设备及存储介质 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022247825A1 (zh) * | 2021-05-24 | 2022-12-01 | 维沃移动通信有限公司 | 信息播报方法和电子设备 |
CN115001890A (zh) * | 2022-05-31 | 2022-09-02 | 四川虹美智能科技有限公司 | 基于免应答的智能家电控制方法及装置 |
CN115001890B (zh) * | 2022-05-31 | 2023-10-31 | 四川虹美智能科技有限公司 | 基于免应答的智能家电控制方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN114391165A (zh) | 2022-04-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107340991B (zh) | 语音角色的切换方法、装置、设备以及存储介质 | |
CN107704275B (zh) | 智能设备唤醒方法、装置、服务器及智能设备 | |
CN110634483B (zh) | 人机交互方法、装置、电子设备及存储介质 | |
JP6505117B2 (ja) | 模写によるデジタル携帯情報端末の対話、および応答時のリッチなマルチメディア | |
CN109637548A (zh) | 基于声纹识别的语音交互方法及装置 | |
CN111124123A (zh) | 基于虚拟机器人形象的语音交互方法及装置、车载设备智能控制系统 | |
CN111261151B (zh) | 一种语音处理方法、装置、电子设备及存储介质 | |
US10249296B1 (en) | Application discovery and selection in language-based systems | |
WO2019242414A1 (zh) | 语音处理方法、装置、存储介质及电子设备 | |
CN107393534B (zh) | 语音交互方法及装置、计算机装置及计算机可读存储介质 | |
US10783884B2 (en) | Electronic device-awakening method and apparatus, device and computer-readable storage medium | |
US11511200B2 (en) | Game playing method and system based on a multimedia file | |
US20200265843A1 (en) | Speech broadcast method, device and terminal | |
WO2019228138A1 (zh) | 音乐播放方法、装置、存储介质及电子设备 | |
WO2021081744A1 (zh) | 语音信息处理方法、装置、设备及存储介质 | |
WO2018076664A1 (zh) | 一种语音播报的方法和装置 | |
CN112912955B (zh) | 提供基于语音识别的服务的电子装置和系统 | |
CN109377979B (zh) | 更新欢迎语的方法和系统 | |
CN112185362A (zh) | 针对用户个性化服务的语音处理方法及装置 | |
CN108492826B (zh) | 音频处理方法、装置、智能设备及介质 | |
WO2021042584A1 (zh) | 全双工语音对话方法 | |
EP4293664A1 (en) | Voiceprint recognition method, graphical interface, and electronic device | |
CN112114770A (zh) | 基于语音交互的界面引导方法、装置及设备 | |
CN111161734A (zh) | 基于指定场景的语音交互方法及装置 | |
CN110139187B (zh) | 智能音箱控制方法、装置以及终端 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19950862 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19950862 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.10.2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19950862 Country of ref document: EP Kind code of ref document: A1 |