WO2021081744A1 - 语音信息处理方法、装置、设备及存储介质 - Google Patents

语音信息处理方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021081744A1
WO2021081744A1 PCT/CN2019/113943 CN2019113943W WO2021081744A1 WO 2021081744 A1 WO2021081744 A1 WO 2021081744A1 CN 2019113943 W CN2019113943 W CN 2019113943W WO 2021081744 A1 WO2021081744 A1 WO 2021081744A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
target
voice information
skill
role
Prior art date
Application number
PCT/CN2019/113943
Other languages
English (en)
French (fr)
Inventor
郝杰
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to CN201980099978.9A priority Critical patent/CN114391165A/zh
Priority to PCT/CN2019/113943 priority patent/WO2021081744A1/zh
Publication of WO2021081744A1 publication Critical patent/WO2021081744A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • This application relates to voice technology, and in particular to a voice information processing method, device, equipment, and storage medium.
  • Smart voice assistants have been widely used in mobile phones, vehicle terminals, smart homes, and other products, freeing users' hands. Users only need to interact with the smart voice assistant through voice to control and operate the product functions.
  • the text to speech (TTS) in the smart voice solution can only provide a role with a single timbre.
  • the timbre is single, which lacks an interesting and anthropomorphic interactive process.
  • the embodiments of the present application expect to provide a voice information processing method, device, equipment, and storage medium.
  • a voice information processing method including:
  • Acquiring voice information collected by a voice collecting unit wherein the voice information includes first voice information, and the first voice information is used to indicate the invocation of the target skill;
  • the first mapping relationship includes at least three mapping relationships between roles and skills
  • the method before the controlling the first target role to perform voice broadcast for the target skill, the method further includes: determining a second target role currently performing voice broadcast; the second target role and the When the first target role is different, the second target role currently performing the voice broadcast is switched to the first target role.
  • the voice information further includes second voice information
  • the second voice information is used to instruct to wake up the second target role; before determining the second target role currently performing voice broadcast, the method further The method includes: recognizing the second voice information from the voice information, and determining a second target character whose wake-up is indicated by the second voice information; and controlling the second target character to perform voice broadcast.
  • the determining the second target role of the wake-up indicated by the second voice information includes: determining the wake-up identifier in the second voice information; and determining the corresponding wake-up identifier from the preset second mapping relationship The second target role; wherein the second mapping relationship includes at least three mapping relationships between roles and wake-up identifiers.
  • the controlling the first target character to perform the voice broadcast for the target skill includes: acquiring the timbre information of the first target character and the voice text information; wherein different roles correspond to different timbre information; Synthesize voice and audio information based on the timbre information of the first target character and the voice text information; control the voice output unit to output the voice and audio information.
  • the method further includes: obtaining third voice information; wherein, the third voice information is used to instruct to quit the current execution The first target role of the voice broadcast; based on the third voice information, control to exit the first target role.
  • the controlling to withdraw from the first target role based on the third voice information includes: determining an exit identifier in the third voice information; and determining all the characters from a preset third mapping relationship table.
  • a voice information processing device including:
  • the acquiring part is configured to acquire the voice information collected by the voice collection unit; wherein the voice information includes first voice information, and the first voice information is used to indicate the invocation of the target skill;
  • the processing part is configured to recognize the first voice information based on a preset skill recognition strategy, and determine the target skill that the first voice information indicates to call;
  • the processing part is configured to determine a first target role corresponding to the target skill from a preset first mapping relationship; wherein the first mapping relationship includes at least three mapping relationships between roles and skills;
  • the control part is configured to control the first target character to perform voice broadcast for the target skill.
  • a voice information processing device including: a processor and a memory configured to store a computer program that can run on the processor,
  • the processor is configured to execute the steps of any one of the aforementioned methods when running the computer program.
  • a computer-readable storage medium is provided, and a computer program is stored thereon, and when the computer program is executed by a processor, the steps of the method described in any one of the foregoing are implemented.
  • the voice information processing method, device, equipment, and storage medium acquire the voice information collected by the voice collection unit; wherein, the voice information includes first voice information, and the first voice information is used to indicate the call Target skills; based on a preset skill recognition strategy, recognize the target skills indicated by the first voice information; then determine the first target role to achieve the target skills; control the first target role to perform the voice for the target skills Broadcast, determine the user's intention expressed in the voice information through intention judgment, and determine the target skill that the user wants to call according to the user's intention, so as to wake up the target role corresponding to the target skill.
  • the role wake-up can be realized more smoothly, the intelligence of voice control can be improved, and multiple roles can be configured to perform voice broadcasting of different skills, which improves the fun of voice control.
  • FIG. 1 is a schematic diagram of the first flow of a voice information processing method in an embodiment of this application
  • FIG. 2 is a schematic diagram of a second flow of a voice information processing method in an embodiment of this application;
  • FIG. 3 is a schematic diagram of the third process of the voice information processing method in an embodiment of this application.
  • FIG. 4 is a schematic diagram of the composition structure of a voice processing system in an embodiment of the application.
  • FIG. 5 is a schematic diagram of the composition structure of the skill processing system in an embodiment of the application.
  • FIG. 6 is a schematic diagram of the composition structure of a voice information processing device in an embodiment of the application.
  • FIG. 7 is a schematic diagram of the composition structure of a voice information processing device in an embodiment of the application.
  • FIG. 1 is a schematic diagram of the first flow of the voice information processing method in the embodiment of the present application. As shown in FIG. 1, the method may specifically include:
  • Step 101 Acquire voice information collected by a voice collection unit; wherein the voice information includes first voice information, and the first voice information is used to indicate the invocation of the target skill;
  • Step 102 Recognizing the first voice information based on a preset skill recognition strategy, and determining that the first voice information indicates the target skill to be invoked;
  • Step 103 Determine the first target role corresponding to the target skill from the preset first mapping relationship; wherein, the first mapping relationship includes at least three mapping relationships between roles and skills;
  • Step 104 Control the first target character to perform voice broadcast for the target skill.
  • the execution subject of step 101 to step 104 may be the processor of the voice information processing device.
  • the voice information processing device may be located on the server side or the terminal side.
  • the terminal can be a mobile terminal or a fixed terminal with a voice control function.
  • smart phones personal computers (such as tablets, desktop computers, notebooks, netbooks, handheld computers), mobile phones, e-book readers, portable multimedia players, audio/video players, cameras, virtual reality devices, and wearables Equipment, etc.
  • the voice collection device may be a microphone.
  • a microphone located on a terminal collects voice information and executes the steps of the above-mentioned voice information processing method locally on the terminal; or uploads the voice information to the server, and the server executes the steps of the above-mentioned voice information processing method, and the server delivers the processing result to the terminal ,
  • the terminal executes the corresponding voice output control operation according to the processing result.
  • the preset skill recognition strategy is used for skill recognition to determine the target skill that the user needs to control the terminal or other electronic equipment through the voice information.
  • the target skill is the skill determined according to the user's intention expressed by the first voice information.
  • the first voice message is "What's the weather today?"
  • the user's intention is to query the weather, and then the corresponding target skill is determined to be "weather query", and the query date is "today”.
  • the user does not need to say a control keyword such as "check the weather”, but uses a more colloquial language to realize the voice control function, the control process is more intelligent, and it conforms to the user's daily communication habits.
  • the recognizing the first voice information based on the preset skill recognition strategy includes: using voice recognition technology to perform text recognition on the first voice information to obtain the first text information; using semantic recognition technology to perform text recognition on the first text information Perform semantic recognition, and obtain the target skill that the first voice information indicates to call.
  • the method further includes: acquiring at least one skill that can be realized by the at least three characters; and establishing a first mapping relationship by using the mapping relationship between the at least three characters and the at least one skill.
  • At least one skill that can be achieved by at least three roles is acquired; at least one skill that can be achieved by each role is used to establish a skill set; and the mapping relationship between at least three roles and the skill set is used to establish a first mapping relationship.
  • the first mapping relationship one type of role corresponds to at least one type of skill, and all skills corresponding to one type of role form a skill set.
  • the first mapping relationship may include the mapping relationship between roles and skills, or the mapping relationship between roles and skill sets.
  • the first target role corresponding to the target skill may be directly determined from the first mapping relationship, or the skill set where the target skill is located is determined first, and then the first target role corresponding to the skill set may be determined from the first mapping relationship.
  • the voice role configured on the terminal can be a role developed by the terminal manufacturer, or a third-party role developed by a third-party manufacturer.
  • the third-party role can be called by downloading a third-party application, or there is no need to download a third-party application.
  • the program calls a third-party role through online access.
  • role A corresponds to skill set A
  • the skills included in set A include "weather query, broadcast broadcasting, audio e-book broadcasting, etc.”
  • Role B corresponds to skill set B, and the skills included in set B include "music playback, music video playback, music anchor live broadcast, etc.”;
  • Role C corresponds to skill set C, and the skills included in set C include "information query, information recommendation, information download, etc.”.
  • the above-mentioned role A can be a role developed by the terminal manufacturer to realize the voice control operation of its own application A, and the role B and role C can be self-developed roles of other terminals to implement application B and application C. Voice control operation.
  • multiple roles are configured to perform voice broadcasts of different skills, which improves the interest of voice control.
  • the first target role corresponding to the target skill is determined from the preset first mapping relationship; wherein, the first mapping relationship includes at least three mapping relationships between roles and skills. Specifically, match the skills in the first mapping relationship of the target skills to determine the corresponding first target role when the matching is successful; or match the target skills with the skill set corresponding to the role, determine the skill set containing the target skills, and thus determine The first target role corresponding to the skill set.
  • controlling the first target character to perform voice broadcast for the target skill includes: acquiring timbre information and voice text information of the first target character; wherein different roles correspond to different timbre information; The timbre information of the target character and the voice text information are synthesized into voice audio information; the voice output unit is controlled to output the voice audio information.
  • the voice text to be output is synthesized with the timbre of the first target character to synthesize voice audio with the timbre of the first target character Information is output through voice output units such as speakers or earphones.
  • the voice information processing method, device, equipment, and storage medium acquire the voice information collected by the voice collection unit; wherein, the voice information includes first voice information, and the first voice information is used to indicate the call Target skills; based on a preset skill recognition strategy, recognize the target skills indicated by the first voice information; then determine the first target role to achieve the target skills; control the first target role to perform the voice for the target skills Broadcast, determine the user's intention expressed in the voice information through intention judgment, and determine the target skill that the user wants to call according to the user's intention, so as to wake up the target role corresponding to the target skill. In this way, the role awakening can be realized more smoothly, the intelligence of voice control can be improved, and the voice broadcast of different skills can be performed by configuring multiple roles, which improves the fun of voice control.
  • FIG. 2 is a schematic diagram of the second flow of the method for processing voice information in an embodiment of this application. As shown in FIG. 2, the method includes:
  • Step 201 Acquire voice information collected by a voice collection unit; wherein the voice information includes first voice information, and the first voice information is used to indicate the invocation of the target skill;
  • the voice collection device may be a microphone.
  • a microphone located on a terminal collects voice information and executes the steps of the above-mentioned voice information processing method locally on the terminal; or uploads the voice information to the server, and the server executes the steps of the above-mentioned voice information processing method, and the server delivers the processing result to the terminal ,
  • the terminal executes the corresponding voice output control operation according to the processing result.
  • Step 202 Recognize the first voice information based on a preset skill recognition strategy, and determine the target skill that the first voice information indicates to call;
  • the preset skill recognition strategy is used for skill recognition to determine the target skill that the user needs to control the terminal or other electronic equipment through the voice information.
  • the target skill is the skill determined according to the user's intention expressed by the first voice information.
  • the first voice message is "What's the weather today?"
  • the user's intention is to query the weather, and then the corresponding target skill is determined to be "weather query", and the query date is "today”.
  • the user does not need to say a control keyword such as "check the weather”, but uses a more colloquial language to realize the voice control function, the control process is more intelligent, and it conforms to the user's daily communication habits.
  • the recognizing the first voice information based on the preset skill recognition strategy includes: using voice recognition technology to perform text recognition on the first voice information to obtain the first text information; using semantic recognition technology to perform text recognition on the first text information Perform semantic recognition, and obtain the target skill that the first voice information indicates to call.
  • Step 203 Determine the first target role corresponding to the target skill from the preset first mapping relationship; wherein, the first mapping relationship includes at least three mapping relationships between roles and skills;
  • the first mapping relationship may include a mapping relationship between a role and a skill, or a mapping relationship between a role and a skill set.
  • the first target role corresponding to the target skill can be directly determined from the first mapping relationship, or the skill set where the target skill is located is determined first, and then the first target role corresponding to the skill set is determined from the first mapping relationship.
  • step 203 may specifically include: matching the skills in the first mapping relationship of the target skills to determine the corresponding first target role when the matching is successful; or matching the target skills with the skill set corresponding to the role to determine that the target skills are included To determine the first target role corresponding to the skill set.
  • the voice role configured on the terminal can be a role developed by the terminal manufacturer, or a third-party role developed by a third-party manufacturer.
  • the third-party role can be called by downloading a third-party application, or there is no need to download a third-party application.
  • the program calls a third-party role through online access.
  • Step 204 Determine the second target role currently performing voice broadcast
  • the second target character is a character that is performing voice broadcast before awakening the first target character.
  • the terminal device is using the second target role for voice communication with the user, it is judged according to the user's speaking intention that the user needs the first target role to perform the target skills, and the first target role can be summoned through the second target role.
  • determining the second target role currently performing voice broadcast includes: detecting the currently performing voice broadcast role, and determining the second target role. For example, detecting the role identification bit of the currently performing voice broadcast role, and determining the second target role that is performing the voice broadcast.
  • a role needs to be awakened for initial communication with the user, for example, the user directly awakens the first target role corresponding to the target skill; or wakes up the system default role; Or wake up roles that users often use.
  • the voice may be awakened.
  • the voice information further includes second voice information, and the second voice information is used to instruct to wake up the second target character; accordingly, the determining that the voice is currently being performed Before broadcasting the second target character, the method further includes: recognizing the second voice information from the voice information, and determining the second target character whose wakeup is indicated by the second voice information; controlling the second target The role performs voice broadcast.
  • the determining the second target role of the wake-up indicated by the second voice information includes: determining the wake-up identifier in the second voice information; and determining all the wake-up identifiers corresponding to the wake-up identifier from the preset second mapping relationship.
  • the role and the wake-up identifier in the second mapping relationship are one-to-one mapping or one-to-many mapping. That is, a character can only be awakened by one wake-up sign, or a character can be awakened by multiple wake-up signs.
  • the wake-up flags for different roles can be uniformly specified by the manufacturer, or set by the user according to habits or preferences.
  • the wake-up flag corresponding to role A is "A classmate, little A hello, foodie little A";
  • the wake-up flag corresponding to character B is "Fat B classmate"
  • the wake-up flag corresponding to role C is "Hello old C, hey old C".
  • acquiring the voice information collected by the voice collection unit includes: simultaneously acquiring the first voice information and the second voice information, or first acquiring the first voice information, and determining the second target role that the second voice information indicates to wake up ; Control the second target role to perform voice broadcast; then acquire second voice information, and determine the target skill that the second voice information indicates to call.
  • Step 205 Determine whether the second target role is the same as the first target role, if not, go to step 206; if yes, go to step 207;
  • Step 206 When the second target role is different from the first target role, switch the second target role currently performing voice broadcast to the first target role;
  • the method when the second target role and the first target role are different, the method further includes: generating switching prompt information; controlling the first target role or the second target role to play the switching Prompt information.
  • the switching prompt information is used to remind the user that the role switching operation will be performed or that the role switching operation has been performed.
  • the controlling the first target role or the second target role to play the switching prompt information includes: before the switching, controlling the second target role to play the switching prompt information; after the switching, controlling the first target role Play the switching prompt information.
  • the switching prompt message can be played before the switching operation to remind the user that the second target character is about to summon the first target character to perform voice broadcast.
  • the voice assistant of character A is awakened by the wake word of character A, and the skill classifier is used to actually determine that the target skill is the skill of character B. Smooth sentences can be added and the voice of character A can be used to broadcast "Questions about XX can be Ask character B", and then feed back the actual result of character B.
  • the switching prompt message may be played after the switching operation to remind the user that the voice broadcast is now performed by the first target role.
  • the wake word of character A to wake up the voice assistant of character A, and use the skill classifier to actually determine that the target skill is the skill of character B.
  • closed-loop dialogues are allowed between different roles, making role switching smoother, more in line with users' daily communication habits, and enhancing the intelligence level of voice control.
  • step 207 is also executed after the switching is completed.
  • Step 207 Control the first target character to perform voice broadcast for the target skill.
  • controlling the first target role to perform voice broadcasting means controlling the second target role to perform voice broadcasting.
  • This step may specifically include: acquiring timbre information and voice text information of the first target character; wherein different roles correspond to different timbre information; synthesizing voice and audio information based on the timbre information of the first target character and the voice text information ; Control the voice output unit to output the voice audio information.
  • the voice text to be output is synthesized with the timbre of the first target character to synthesize voice audio with the timbre of the first target character Information is output through voice output units such as speakers or earphones.
  • FIG. 3 is a schematic diagram of the third process of the method for processing voice information in an embodiment of this application. As shown in FIG. 3, the method includes:
  • Step 301 Acquire voice information collected by a voice collection unit; wherein the voice information includes first voice information, and the first voice information is used to indicate the invocation of the target skill;
  • the voice collection device may be a microphone.
  • a microphone located on a terminal collects voice information and executes the steps of the above-mentioned voice information processing method locally on the terminal; or uploads the voice information to the server, and the server executes the steps of the above-mentioned voice information processing method, and the server delivers the processing result to the terminal ,
  • the terminal executes the corresponding voice output control operation according to the processing result.
  • Step 302 Recognizing the first voice information based on a preset skill recognition strategy, and determining that the first voice information indicates the target skill to be invoked;
  • the preset skill recognition strategy is used for skill recognition to determine the target skill that the user needs to control the terminal or other electronic equipment through the voice information.
  • the target skill is the skill determined according to the user's intention expressed by the first voice information.
  • the first voice message is "What's the weather today?"
  • the user's intention is to query the weather, and then the corresponding target skill is determined to be "weather query", and the query date is "today”.
  • the user does not need to say a control keyword such as "check the weather”, but uses a more colloquial language to realize the voice control function, the control process is more intelligent, and it conforms to the user's daily communication habits.
  • the recognizing the first voice information based on the preset skill recognition strategy includes: using voice recognition technology to perform text recognition on the first voice information to obtain the first text information; using semantic recognition technology to perform text recognition on the first text information Perform semantic recognition, and obtain the target skill that the first voice information indicates to call.
  • Step 303 Determine the first target role corresponding to the target skill from the preset first mapping relationship; wherein, the first mapping relationship includes at least three mapping relationships between roles and skills;
  • the first mapping relationship may include a mapping relationship between a role and a skill, or a mapping relationship between a role and a skill set.
  • the first target role corresponding to the target skill may be directly determined from the first mapping relationship, or the skill set where the target skill is located is determined first, and then the first target role corresponding to the skill set may be determined from the first mapping relationship.
  • step 303 may specifically include: matching the skills in the first mapping relationship of the target skills to determine the corresponding first target role when the matching is successful; or matching the target skills with the skill set corresponding to the role to determine that the target skills are included To determine the first target role corresponding to the skill set.
  • the voice role configured on the terminal can be a role developed by the terminal manufacturer, or a third-party role developed by a third-party manufacturer.
  • the third-party role can be called by downloading a third-party application, or there is no need to download a third-party application.
  • the program calls a third-party role through online access, and expands the range of voice skills by calling the third-party role, and improves the processing effect of the user's voice information.
  • Step 304 Control the first target character to perform voice broadcast for the target skill
  • This step may specifically include: acquiring timbre information and voice text information of the first target character; wherein different roles correspond to different timbre information; synthesizing voice and audio information based on the timbre information of the first target character and the voice text information ; Control the voice output unit to output the voice audio information.
  • the voice text to be output is synthesized with the timbre of the first target character to synthesize voice audio with the timbre of the first target character Information is output through voice output units such as speakers or earphones.
  • Step 305 Acquire third voice information; wherein, the third voice information is used to instruct to quit the first target role currently performing voice broadcast;
  • Step 306 Based on the third voice information, control to exit the first target role.
  • this step specifically includes: determining the exit identifier in the third voice information; determining the first target role corresponding to the exit identifier from a preset third mapping relationship table; wherein, the The third mapping relationship includes at least three mapping relationships between roles and exit identifiers; the first target role is controlled to exit.
  • the role and exit identifier in the third mapping relationship are one-to-one mapping or one-to-many mapping. That is, a role can only be exited by one exit identifier, or a role can be exited by multiple exit identifiers.
  • the exit signs for different roles can be uniformly specified by the manufacturer, or set by the user according to habits or preferences.
  • exit identifier corresponding to role A is "Exit classmate A, small A retreat, walk away foodie small A";
  • the exit sign corresponding to role B is "Go away fat student B";
  • the exit identifier corresponding to role C is "Exit old C, goodbye old C".
  • exit identifiers are associated with different roles to improve the flexibility of role control.
  • FIG. 4 is a schematic diagram of the composition structure of the voice processing system in an embodiment of the application.
  • the voice processing system includes: a voice assistant client 401, Voice assistant central control server 402, recognition server 403, skill recognizer 404, role A server 405, role B server 406, and role C server 407.
  • the user realizes the collection, uploading, and receiving of voice output results of voice information (including audio and role wake-up words) and outputting voice information of different roles.
  • voice information including audio and role wake-up words
  • the audio is the first voice information
  • the character wake-up word is the second voice information.
  • the voice assistant central control server 402 to the role C server 407 are used to process voice data.
  • the voice assistant central control server 402 is configured to receive the voice information uploaded by the voice assistant client 401, use voice recognition technology to perform text recognition on the voice information to obtain the recognized text; send the recognized text to the skill classifier 404;
  • Skill classifier 404 uses semantic recognition technology to perform semantic understanding of text information to determine the target skill.
  • the role A server 405 is used to perform the skill A service
  • the role B server 406 is used to perform the service.
  • Skill B service when the target skill is skill C, the role C server 407 is used to perform skill C service.
  • the role A server 405 processes skill A, and sends the obtained skill A intention result, skill A resource service result, response text, and role A response audio to the voice assistant central control server 402;
  • the role B server 406 processes skill B, and sends the obtained skill B intent result, skill B resource service result, response text, and role B response audio to the voice assistant central control server 402;
  • the role C server 407 processes skill C, and sends the obtained skill C intention result, skill C resource service result, response text, and role C response audio to the voice assistant central control server 402;
  • the voice assistant central control server 402 performs voice synthesis according to the received processing result to generate a voice output result, and sends the voice output result to the voice assistant client 401; the voice assistant client 401 controls the output of the voice output result.
  • FIG. 5 is a schematic diagram of the composition structure of the skill processing system in the embodiment of the application. As shown in FIG. 5, the system includes: a skill server 501, a semantic understanding server 502, a resource recall server 503, and a TTS server 504.
  • the skills server 501 sends the received recognition text to the semantic understanding server 502, and the semantic understanding server 502 performs semantic understanding on the recognized text, obtains the user's intention result, and returns the intention result to the skills server 501;
  • the skill server 501 sends the intent result to the resource recall server 503, and the resource recall server 503 determines the resource service result and response text according to the intent result and sends them to the skill server 501;
  • the skills server 501 then sends the response text to the TTS server 504, and the TTS server performs speech synthesis according to the character tone and the response text to generate response audio, and returns the response audio to the skills server 501;
  • the skill server 501 sends the obtained intent result, resource service result, response text, and response audio to the voice assistant central control server.
  • the voice assistant client is located on the terminal side, and the terminal side further includes a voice collection unit for collecting voice data; other servers that implement voice information processing are located on the server side.
  • part or all of the servers that implement voice information processing may also be located on the terminal side.
  • An embodiment of the present application also provides a voice information processing device. As shown in FIG. 6, the device includes:
  • the acquiring part 601 is configured to acquire voice information collected by a voice collecting unit; wherein the voice information includes first voice information, and the first voice information is used to indicate the invocation of the target skill;
  • the processing part 602 is configured to recognize the first voice information based on a preset skill recognition strategy, and determine the target skill that the first voice information indicates to call;
  • the processing part 602 is configured to determine a first target role corresponding to the target skill from a preset first mapping relationship; wherein, the first mapping relationship includes at least three mapping relationships between roles and skills;
  • the control part 603 is configured to control the first target character to perform voice broadcast for the target skill.
  • the processing part before the controlling the first target role to perform the voice broadcast for the target skill, the processing part is configured to determine the second target role currently performing the voice broadcast;
  • the device further includes: a switching part configured to switch the second target role currently performing voice broadcast to the first target role when the second target role is different from the first target role.
  • the voice information further includes second voice information, the second voice information is used to instruct to wake up the second target role; the processing part is configured to recognize the voice information Second voice information, and determine the second target role that the second voice information indicates to wake up; control the second target role to perform voice broadcast.
  • the processing part is configured to determine the wake-up identifier in the second voice information; determine the second target role corresponding to the wake-up identifier from the preset second mapping relationship; wherein, The second mapping relationship includes at least three mapping relationships between roles and wakeup identifiers.
  • control part is configured to obtain the timbre information and voice text information of the first target character; wherein, different roles correspond to different timbre information; based on the timbre information of the first target character and the voice Text information, synthesized voice and audio information; control the voice output unit to output the voice and audio information.
  • the acquiring part is configured to acquire third voice information; wherein, the third voice information is used to indicate to quit the first target role currently performing voice broadcasting; and the control part is configured to be based on The third voice information controls to exit the first target role.
  • the processing part is configured to determine the exit identifier in the third voice information; determine the first target role corresponding to the exit identifier from a preset third mapping relationship table; wherein, The third mapping relationship includes at least three mapping relationships between roles and exit identifiers; the control part is configured to control the exit of the first target role.
  • the embodiment of the present application also provides a voice information processing device.
  • the device includes: a processor 701 and a memory 702 configured to store a computer program that can run on the processor; the processor 701 runs the memory 702
  • the steps of the method in the foregoing embodiments are implemented when the computer program is installed.
  • bus system 703 is used to implement connection and communication between these components.
  • the bus system 703 also includes a power bus, a control bus, and a status signal bus.
  • various buses are marked as the bus system 703 in FIG. 7.
  • An embodiment of the present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the method described in any of the foregoing embodiments are implemented.
  • the above-mentioned processors can be application-specific integrated circuits (ASIC, Application Specific Integrated Circuit), digital signal processing devices (DSPD, Digital Signal Processing Device), programmable logic devices (PLD, Programmable Logic Device), and on-site At least one of Field-Programmable Gate Array (FPGA), controller, microcontroller, and microprocessor.
  • ASIC Application Specific Integrated Circuit
  • DSPD digital signal processing devices
  • PLD programmable logic devices
  • FPGA Field-Programmable Gate Array
  • controller microcontroller
  • microprocessor microprocessor
  • the aforementioned memory may be a volatile memory (volatile memory), such as a random access memory (RAM, Random-Access Memory); or a non-volatile memory (non-volatile memory), such as a read-only memory (ROM, Read-Only Memory), flash memory (flash memory), hard disk (HDD, Hard Disk Drive) or solid state drive (SSD, Solid-State Drive); or a combination of the above types of memory, and provides instructions and data to the processor.
  • volatile memory such as a random access memory (RAM, Random-Access Memory
  • non-volatile memory such as a read-only memory (ROM, Read-Only Memory), flash memory (flash memory), hard disk (HDD, Hard Disk Drive) or solid state drive (SSD, Solid-State Drive
  • SSD Solid-State Drive

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音信息处理方法、装置、设备及存储介质,该方法包括:获取语音采集单元采集的语音信息;其中,语音信息包括第一语音信息,第一语音信息用于指示调用的目标技能(101);基于预设的技能识别策略识别第一语音信息,确定第一语音信息指示调用的目标技能(102);再确定实现目标技能的第一目标角色;控制第一目标角色执行针对目标技能的语音播报(104),通过意图判断确定语音信息中所表达的用户意图,根据用户意图确定用户想要调用的目标技能,从而唤醒目标技能对应的目标角色。如此,能够更加顺畅的实现角色唤醒,提高语音控制的智能性,且通过配置多种角色来执行不同技能的语音播报,提高了语音控制的趣味性。

Description

语音信息处理方法、装置、设备及存储介质 技术领域
本申请涉及语音技术,尤其涉及一种语音信息处理方法、装置、设备及存储介质。
背景技术
智能语音助手已经被广泛应用在手机、车载终端、智能家居等产品中,解放了用户双手,用户只需要与智能语音助手通过语音交互,就可以实现对产品功能的控制操作。
目前智能语音方案中语音合成系统(Text to Speech,TTS)只能提供单一音色的角色,角色在向用户播报语音信息时音色单一,缺少趣味性和拟人化的互动过程。
发明内容
为解决相关技术问题,本申请实施例期望提供一种语音信息处理方法、装置、设备及存储介质。
本申请实施例的技术方案是这样实现的:
第一方面,提供了一种语音信息处理方法,包括:
获取语音采集单元采集的语音信息;其中,所述语音信息包括第一语音信息,所述第一语音信息用于指示调用目标技能;
基于预设的技能识别策略识别所述第一语音信息,确定所述第一语音信息指示调用的目标技能;
从预设的第一映射关系中,确定所述目标技能对应的第一目标角色;其中,所述第一映射关系中包括至少三种角色与技能的映射关系;
控制所述第一目标角色执行针对所述目标技能的语音播报。
上述方案中,所述控制所述第一目标角色执行针对所述目标技能的语音播报之前,所述方法还包括:确定当前执行语音播报的第二目标角色;所述第二目标角色和所述第一目标角色不同时,将当前执行语音播报的所述第二目标角色切换为所述第一目标角色。
上述方案中,所述语音信息还包括第二语音信息,所述第二语音信息用于指示唤醒所述第二目标角色;所述确定当前执行语音播报的第二目标角色之前,所述方法还包括:从所述语音信息中识别所述第二语音信息,并确定所述第二语音信息指示唤醒的第二目标角色;控制所述第二目标角色执行语音播报。
上述方案中,所述确定所述第二语音信息指示唤醒的第二目标角色,包括:确定第二语音信息中的唤醒标识;从预设的第二映射关系中,确定所述唤醒标识对应的所述第二目标角色;其中,所述第二映射关系中包含至少三种角色与唤醒标识的映射关系。
上述方案中,所述控制所述第一目标角色执行针对所述目标技能的语音播报,包括:获取所述第一目标角色的音色信息,以及语音文本信息;其中,不同角色对应不同音色信息;基于第一目标角色的音色信息和所述语音文本信息,合成语音音频信息;控制语音输出单元输出所述语音音频信息。
上述方案中,所述控制所述第一目标角色执行针对所述目标技能的语音播报之后,所述方法还包括:获取第三语音信息;其中,所述第三语音信息用于指示退出当前执行语音播报的第一目标角色;基于所述第三语音信息,控制退出所述第一目标角色。
上述方案中,所述基于所述第三语音信息,控制退出所述第一目标角色,包括:确定所述第三语音信息中的退出标识;从预设的第三映射关系 表中,确定所述退出标识对应的第一目标角色;其中,所述第三映射关系中包含至少三种角色与退出标识的映射关系;控制退出所述第一目标角色。
第二方面,提供了一种语音信息处理装置,包括:
获取部分,配置为获取语音采集单元采集的语音信息;其中,所述语音信息包括第一语音信息,所述第一语音信息用于指示调用目标技能;
处理部分,配置为基于预设的技能识别策略识别所述第一语音信息,确定所述第一语音信息指示调用的目标技能;
所述处理部分,配置为从预设的第一映射关系中,确定所述目标技能对应的第一目标角色;其中,所述第一映射关系中包括至少三种角色与技能的映射关系;
控制部分,配置为控制所述第一目标角色执行针对所述目标技能的语音播报。
第三方面,提供了一种语音信息处理设备,包括:处理器和配置为存储能够在处理器上运行的计算机程序的存储器,
其中,所述处理器配置为运行所述计算机程序时,执行前述任一项所述方法的步骤。
第四方面,提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现前述任一项所述的方法的步骤。
本申请实施例提供的语音信息处理方法、装置、设备及存储介质,通过获取语音采集单元采集的语音信息;其中,所述语音信息包括第一语音信息,所述第一语音信息用于指示调用的目标技能;基于预设的技能识别策略,识别所述第一语音信息指示的所述目标技能;再确定实现目标技能的第一目标角色;控制第一目标角色执行针对所述目标技能的语音播报,通过意图判断确定语音信息中所表达的用户意图,根据用户意图确定用户想要调用的目标技能,从而唤醒目标技能对应的目标角色。如此,能够更 加顺畅的实现角色唤醒,提高语音控制的智能性,且通过配置多种角色来执行不同技能的语音播报,提高了语音控制的趣味性。
附图说明
图1为本申请实施例中语音信息处理方法的第一流程示意图;
图2为本申请实施例中语音信息处理方法的第二流程示意图;
图3为本申请实施例中语音信息处理方法的第三流程示意图;
图4为本申请实施例中语音处理系统的组成结构示意图;
图5为本申请实施例中技能处理系统的组成结构示意图;
图6为本申请实施例中语音信息处理装置的组成结构示意图;
图7为本申请实施例中语音信息处理设备的组成结构示意图。
具体实施方式
为了能够更加详尽地了解本申请实施例的特点与技术内容,下面结合附图对本申请实施例的实现进行详细阐述,所附附图仅供参考说明之用,并非用来限定本申请实施例。
本申请实施例提供了一种语音信息处理方法,图1为本申请实施例中语音信息处理方法的第一流程示意图,如图1所示,该方法具体可以包括:
步骤101:获取语音采集单元采集的语音信息;其中,所述语音信息包括第一语音信息,所述第一语音信息用于指示调用目标技能;
步骤102:基于预设的技能识别策略识别所述第一语音信息,确定所述第一语音信息指示调用的目标技能;
步骤103:从预设的第一映射关系中,确定所述目标技能对应的第一目标角色;其中,所述第一映射关系中包括至少三种角色与技能的映射关系;
步骤104:控制所述第一目标角色执行针对所述目标技能的语音播报。
这里,步骤101至步骤104的执行主体可以为语音信息处理装置的处 理器。这里,语音信息处理装置可以位于服务器侧或终端侧。终端可以为具备语音控制功能的移动终端或者固定终端。比如,智能手机、个人电脑(例如平板电脑、台式电脑、笔记本、上网本、掌上电脑)、移动电话、电子书阅读器、便携式多媒体播放器、音频/视频播放器、摄像机、虚拟现实设备和可穿戴设备等。
这里,语音采集装置可以为麦克风。比如,位于终端上的麦克风采集语音信息,在终端本地执行上述语音信息处理方法的步骤;或者将语音信息上传至服务器,由服务器执行上述语音信息处理方法的步骤,服务器将处理结果下发至终端,终端根据处理结果执行对应的语音输出控制操作。
实际应用中,预设的技能识别策略用于技能识别,确定用户通过语音信息所要控制终端或其他电子设备实现的目标技能,这里,目标技能是根据第一语音信息所表达的用户意图确定的技能,通过识别用户意图确定用户下一步想要控制终端实现的目标技能。比如,第一语音信息为“今天天气怎样?”根据第一语音信息识别用户意图为查询天气,然后确定对应的目标技能为“天气查询”,查询日期为“今天”。这样,用户无需说出“查询天气”这样的控制关键字,而是使用一种更加口语化的语言,便可实现语音控制功能,控制过程更加智能化,符合用户日常交流习惯。
具体的,所述基于预设的技能识别策略识别所述第一语音信息,包括:利用语音识别技术对第一语音信息进行文本识别,得到第一文本信息;利用语义识别技术对第一文本信息进行语义识别,得到第一语音信息指示调用的目标技能。
也就是说,利用语音识别技术识别出语音信息所包含的文本信息,利用语义识别技术识别出文本信息所指示的目标技能。
在一些实施例中,该方法还包括:获取至少三种角色所能实现的至少一种技能;利用所述至少三种角色与至少一种技能的映射关系建立第一映 射关系。
或者,获取至少三种角色所能实现的至少一种技能;利用每种角色所能实现的至少一种技能建立技能集合;利用至少三种角色与技能集合的映射关系建立第一映射关系。第一映射关系中一种角色对应至少一种技能,一种角色对应的所有技能组成技能集合。
也就是说,第一映射关系中可以包括角色和技能的映射关系,或者角色和技能集合的映射关系。可以从第一映射关系直接确定目标技能对应的第一目标角色,或者先确定目标技能所在的技能集合,再从第一映射关系中确定技能集合对应的第一目标角色。
实际应用中,不同角色对应的技能相同或者不相同,即不同角色可以实现的相同或者不同的技能。这里,终端上所配置的语音角色可以是终端制造商开发的角色,也可以是第三方制造商所开发的第三方角色,通过下载第三方应用程序调用第三方角色,也可以无需下载第三方应用程序,通过在线访问的方式调用第三方角色。
比如,角色A对应技能集合A,集合A中包括的技能有“天气查询、广播播放、有声电子书播放等”;
角色B对应技能集合B,集合B中包括的技能有“音乐播放、音乐视频播放、音乐主播直播等”;
角色C对应技能集合C,集合C中包括的技能有“信息查询、信息推荐、信息下载等”。
上述角色A可以是终端制造商自行开发的角色,用于实现自身应用程序A的语音控制操作,角色B和角色C可以是其他终端自行开发的角色,用于实现应用程序B和应用程序C的语音控制操作。
本申请实施例中,通过配置多种角色来执行不同技能的语音播报,提高了语音控制的趣味性。
进一步地,从预设的第一映射关系中,确定所述目标技能对应的第一目标角色;其中,所述第一映射关系中包括至少三种角色与技能的映射关系。具体的,将目标技能第一映射关系中的技能进行匹配,确定匹配成功时对应的第一目标角色;或者将目标技能与角色对应的技能集合进行匹配,确定包含目标技能的技能集合,从而确定技能集合对应的第一目标角色。
进一步地,控制所述第一目标角色执行针对所述目标技能的语音播报,包括:获取所述第一目标角色的音色信息,以及语音文本信息;其中,不同角色对应不同音色信息;基于第一目标角色的音色信息和所述语音文本信息,合成语音音频信息;控制语音输出单元输出所述语音音频信息。
也就是说,在执行目标技能时如果需要输出语音信息,实现与用户的语音交互操作,则将待输出的语音文本与第一目标角色的音色进行合成,合成具有第一目标角色音色的语音音频信息,通过扬声器或耳机等语音输出单元进行输出。
本申请实施例提供的语音信息处理方法、装置、设备及存储介质,通过获取语音采集单元采集的语音信息;其中,所述语音信息包括第一语音信息,所述第一语音信息用于指示调用的目标技能;基于预设的技能识别策略,识别所述第一语音信息指示的所述目标技能;再确定实现目标技能的第一目标角色;控制第一目标角色执行针对所述目标技能的语音播报,通过意图判断确定语音信息中所表达的用户意图,根据用户意图确定用户想要调用的目标技能,从而唤醒目标技能对应的目标角色。如此,能够更加顺畅的实现角色唤醒,提高语音控制的智能性,且通过配置多种角色来执行不同技能的语音播报,提高了语音控制的趣味性。
在上述实施例的基础上还提供了一种更详细的语音信息处理方法,图2为本申请实施例中语音信息处理方法的第二流程示意图,如图2所示,该方法包括:
步骤201:获取语音采集单元采集的语音信息;其中,语音信息包括第一语音信息,第一语音信息用于指示调用目标技能;
这里,语音采集装置可以为麦克风。比如,位于终端上的麦克风采集语音信息,在终端本地执行上述语音信息处理方法的步骤;或者将语音信息上传至服务器,由服务器执行上述语音信息处理方法的步骤,服务器将处理结果下发至终端,终端根据处理结果执行对应的语音输出控制操作。
步骤202:基于预设的技能识别策略识别第一语音信息,确定第一语音信息指示调用的目标技能;
实际应用中,预设的技能识别策略用于技能识别,确定用户通过语音信息所要控制终端或其他电子设备实现的目标技能,这里,目标技能是根据第一语音信息所表达的用户意图确定的技能,通过识别用户意图确定用户下一步想要控制终端实现的目标技能。比如,第一语音信息为“今天天气怎样?”根据第一语音信息识别用户意图为查询天气,然后确定对应的目标技能为“天气查询”,查询日期为“今天”。这样,用户无需说出“查询天气”这样的控制关键字,而是使用一种更加口语化的语言,便可实现语音控制功能,控制过程更加智能化,符合用户日常交流习惯。
具体的,所述基于预设的技能识别策略识别所述第一语音信息,包括:利用语音识别技术对第一语音信息进行文本识别,得到第一文本信息;利用语义识别技术对第一文本信息进行语义识别,得到第一语音信息指示调用的目标技能。
步骤203:从预设的第一映射关系中,确定目标技能对应的第一目标角色;其中,第一映射关系中包括至少三种角色与技能的映射关系;
这里,第一映射关系中可以包括角色和技能的映射关系,或者角色和技能集合的映射关系。可以从第一映射关系直接确定目标技能对应的第一目标角色,或者先确定目标技能所在的技能集合,再从第一映射关系中确 定技能集合对应的第一目标角色。
相应的,步骤203具体可以包括:将目标技能第一映射关系中的技能进行匹配,确定匹配成功时对应的第一目标角色;或者将目标技能与角色对应的技能集合进行匹配,确定包含目标技能的技能集合,从而确定技能集合对应的第一目标角色。
实际应用中,不同角色对应的技能相同或者不相同,即不同角色可以实现的相同或者不同的技能。这里,终端上所配置的语音角色可以是终端制造商开发的角色,也可以是第三方制造商所开发的第三方角色,通过下载第三方应用程序调用第三方角色,也可以无需下载第三方应用程序,通过在线访问的方式调用第三方角色。
步骤204:确定当前执行语音播报的第二目标角色;
这里,第二目标角色为唤醒第一目标角色之前正在执行语音播报的角色。比如,终端设备正在使用第二目标角色与用户进行语音交流时,此时根据用户的说话意图判断用户需要第一目标角色执行目标技能,可以通过第二目标角色召唤出第一目标角色。
在一些实施例中,确定当前执行语音播报的第二目标角色包括:检测当前执行语音播报角色,确定第二目标角色。比如,检测当前执行语音播报角色的角色标识位,确定正在执行语音播报的第二目标角色。
在一些实施例中,在确定执行目标技能的第一目标角色之前,需要先唤醒一个角色与用户进行初始交流,比如,用户直接唤醒调用目标技能对应的第一目标角色;或者唤醒系统默认角色;或者唤醒用户经常使用的角色。
在一些实施例中,可以通过语音唤醒,比如:所述语音信息还包括第二语音信息,所述第二语音信息用于指示唤醒所述第二目标角色;相应的,所述确定当前执行语音播报的第二目标角色之前,所述方法还包 括:从所述语音信息中识别所述第二语音信息,并确定所述第二语音信息指示唤醒的第二目标角色;控制所述第二目标角色执行语音播报。
进一步地,所述确定所述第二语音信息指示唤醒的第二目标角色,包括:确定第二语音信息中的唤醒标识;从预设的第二映射关系中,确定所述唤醒标识对应的所述第二目标角色;其中,所述第二映射关系中包含至少三种角色与唤醒标识的映射关系。
实际应用中,第二映射关系中角色和唤醒标识是一对一的映射,或者一对多的映射。即一个角色只能被一种唤醒标识唤醒,或者一个角色可以被多个唤醒标识唤醒。不同角色的唤醒标识可以由制造商统一规定,或由用户根据习惯或喜好自行设定。
比如,角色A对应的唤醒标识为“A同学、小A你好、吃货小A”;
角色B对应的唤醒标识为“胖B同学”;
角色C对应的唤醒标识为“老C你好、嗨老C”。
这里,为不同角色关联不同唤醒标识,提高角色控制的灵活性。
实时应用中,获取语音采集单元采集的语音信息,包括:同时获取第一语音信息和第二语音信息,或者,先获取第一语音信息,确定所述第二语音信息指示唤醒的第二目标角色;控制所述第二目标角色执行语音播报;再获取第二语音信息,确定第二语音信息指示调用的目标技能。
步骤205:判断第二目标角色和第一目标角色是否相同,如果否,执行步骤206;如果是,执行步骤207;
步骤206:第二目标角色和第一目标角色不同时,将当前执行语音播报的第二目标角色切换为第一目标角色;
在一些实施例中,所述第二目标角色和所述第一目标角色不同时,该方法还包括:生成切换提示信息;控制所述第一目标角色或所述第二目标角色播放所述切换提示信息。这里,切换提示信息用于提示用户即 将执行角色切换操作,或者已经执行完角色切换操作。
所述控制所述第一目标角色或所述第二目标角色播放所述切换提示信息,包括:在切换之前,控制第二目标角色播放所述切换提示信息;在切换之后,控制第一目标角色播放所述切换提示信息。
也就是说,可以在切换操作之前播放切换提示信息,提醒用户即将由第二目标角色召唤出第一目标角色执行语音播报。示例性的,用角色A的唤醒词唤醒角色A的语音助手,使用技能分类器实际判断目标技能为角色B的技能,可以增加顺滑语句同时用角色A的音色进行播报“关于XX的问题可以问角色B哟”,然后再反馈角色B的实际结果。
或者,可以在切换操作之后播放切换提示信息,提醒用户即现在由第一目标角色执行语音播报。示例性的,用角色A的唤醒词唤醒角色A的语音助手,使用技能分类器实际判断目标技能为角色B的技能,可以增加顺滑语句同时用角色B的音色进行播报“关于XX的问题角色A不知道,就有我角色B来回答你吧”,然后反馈角色B的实际结果。
这里,不同角色之间允许闭环对话,使角色切换变得更加顺畅,更加符合用户日常交流习惯,提升语音控制的智能化水平。
这里,切换完后执行同样执行步骤207。
步骤207:控制第一目标角色执行针对目标技能的语音播报。
可以理解的是,当第二目标角色和所述第一目标角色相同,第一目标角色即为第二目标角色,控制第一目标角色执行语音播报也就是控制第二目标角色执行语音播报。
该步骤具体可以包括:获取所述第一目标角色的音色信息,以及语音文本信息;其中,不同角色对应不同音色信息;基于第一目标角色的音色信息和所述语音文本信息,合成语音音频信息;控制语音输出单元输出所述语音音频信息。
也就是说,在执行目标技能时如果需要输出语音信息,实现与用户的语音交互操作,则将待输出的语音文本与第一目标角色的音色进行合成,合成具有第一目标角色音色的语音音频信息,通过扬声器或耳机等语音输出单元进行输出。
在上述实施例的基础上还提供了一种更详细的语音信息处理方法,图3为本申请实施例中语音信息处理方法的第三流程示意图,如图3所示,该方法包括:
步骤301:获取语音采集单元采集的语音信息;其中,所述语音信息包括第一语音信息,所述第一语音信息用于指示调用目标技能;
这里,语音采集装置可以为麦克风。比如,位于终端上的麦克风采集语音信息,在终端本地执行上述语音信息处理方法的步骤;或者将语音信息上传至服务器,由服务器执行上述语音信息处理方法的步骤,服务器将处理结果下发至终端,终端根据处理结果执行对应的语音输出控制操作。
步骤302:基于预设的技能识别策略识别所述第一语音信息,确定所述第一语音信息指示调用的目标技能;
实际应用中,预设的技能识别策略用于技能识别,确定用户通过语音信息所要控制终端或其他电子设备实现的目标技能,这里,目标技能是根据第一语音信息所表达的用户意图确定的技能,通过识别用户意图确定用户下一步想要控制终端实现的目标技能。比如,第一语音信息为“今天天气怎样?”根据第一语音信息识别用户意图为查询天气,然后确定对应的目标技能为“天气查询”,查询日期为“今天”。这样,用户无需说出“查询天气”这样的控制关键字,而是使用一种更加口语化的语言,便可实现语音控制功能,控制过程更加智能化,符合用户日常交流习惯。
具体的,所述基于预设的技能识别策略识别所述第一语音信息,包括:利用语音识别技术对第一语音信息进行文本识别,得到第一文本信息;利 用语义识别技术对第一文本信息进行语义识别,得到第一语音信息指示调用的目标技能。
步骤303:从预设的第一映射关系中,确定所述目标技能对应的第一目标角色;其中,所述第一映射关系中包括至少三种角色与技能的映射关系;
这里,第一映射关系中可以包括角色和技能的映射关系,或者角色和技能集合的映射关系。可以从第一映射关系直接确定目标技能对应的第一目标角色,或者先确定目标技能所在的技能集合,再从第一映射关系中确定技能集合对应的第一目标角色。
相应的,步骤303具体可以包括:将目标技能第一映射关系中的技能进行匹配,确定匹配成功时对应的第一目标角色;或者将目标技能与角色对应的技能集合进行匹配,确定包含目标技能的技能集合,从而确定技能集合对应的第一目标角色。
实际应用中,不同角色对应的技能相同或者不相同,即不同角色可以实现的相同或者不同的技能。这里,终端上所配置的语音角色可以是终端制造商开发的角色,也可以是第三方制造商所开发的第三方角色,通过下载第三方应用程序调用第三方角色,也可以无需下载第三方应用程序,通过在线访问的方式调用第三方角色,通过调用第三方角色,来扩展语音技能范围,提高对用户语音信息的处理效果。
步骤304:控制所述第一目标角色执行针对所述目标技能的语音播报;
该步骤具体可以包括:获取所述第一目标角色的音色信息,以及语音文本信息;其中,不同角色对应不同音色信息;基于第一目标角色的音色信息和所述语音文本信息,合成语音音频信息;控制语音输出单元输出所述语音音频信息。
也就是说,在执行目标技能时如果需要输出语音信息,实现与用户的语音交互操作,则将待输出的语音文本与第一目标角色的音色进行合成, 合成具有第一目标角色音色的语音音频信息,通过扬声器或耳机等语音输出单元进行输出。
步骤305:获取第三语音信息;其中,所述第三语音信息用于指示退出当前执行语音播报的第一目标角色;
步骤306:基于所述第三语音信息,控制退出所述第一目标角色。
在一些实施例中,该步骤具体包括:确定所述第三语音信息中的退出标识;从预设的第三映射关系表中,确定所述退出标识对应的第一目标角色;其中,所述第三映射关系中包含至少三种角色与退出标识的映射关系;控制退出所述第一目标角色。
实际应用中,第三映射关系中角色和退出标识是一对一的映射,或者一对多的映射。即一个角色只能被一种退出标识退出,或者一个角色可以被多个退出标识退出。不同角色的退出标识可以由制造商统一规定,或由用户根据习惯或喜好自行设定。
比如,角色A对应的退出标识为“退出A同学、小A退下、走开吃货小A”;
角色B对应的退出标识为“走开胖B同学”;
角色C对应的退出标识为“退出老C、老C再见”。
这里,为不同角色关联不同退出标识,提高角色控制的灵活性。
在上述实施例的基础上还提供了一种语音信息处理场景,图4为本申请实施例中语音处理系统的组成结构示意图,如图4所示,语音处理系统包括:语音助手客户端401、语音助手中控服务器402、识别服务器403、技能识别器404、角色A服务器405、角色B服务器406和角色C服务器407。
这里,语音助手客户端401,用户实现语音信息(包括音频和角色唤醒词)的采集、上传、接收语音输出结果和输出不同角色的语音信息。这里, 音频为第一语音信息,角色唤醒词为第二语音信息。
语音助手中控服务器402至角色C服务器407,用于实现语音数据的处理。
其中,语音助手中控服务器402,用于接收语音助手客户端401上传的语音信息,采用语音识别技术对语音信息进行文本识别,得到识别文本;发送识别文本至技能分类器404;
技能分类器404,采用语义识别技术对文本信息进行语义理解,确定目标技能,当目标技能为技能A则使用角色A服务器405执行技能A服务;当目标技能为技能B则使用角色B服务器406执行技能B服务;当目标技能为技能C则使用角色C服务器407执行技能C服务。
角色A服务器405处理技能A,将得到的技能A意图结果、技能A资源服务结果、应答文本和角色A应答音频发送给语音助手中控服务器402;
角色B服务器406处理技能B,将得到的技能B意图结果、技能B资源服务结果、应答文本和角色B应答音频发送给语音助手中控服务器402;
角色C服务器407处理技能C,将得到的技能C意图结果、技能C资源服务结果、应答文本和角色C应答音频发送给语音助手中控服务器402;
语音助手中控服务器402根据接收到的处理结果进行语音合成生成语音输出结果,并发送语音输出结果至语音助手客户端401;语音助手客户端401控制输出语音输出结果。
图5为本申请实施例中技能处理系统的组成结构示意图,如图5所示,该系统包括:技能服务器501、语义理解服务器502、资源召回服务器503和TTS服务器504。
技能服务器501将接收到的识别文本发送至语义理解服务器502,语义理解服务器502对识别文本进行语义理解,得到用户的意图结果,并将意图结果返回给技能服务器501;
技能服务器501将意图结果发送给资源召回服务器503,资源召回服务器503根据意图结果确定资源服务结果和应答文本并发送给技能服务器501;
技能服务器501再将应答文本发送给TTS服务器504,TTS服务器根据角色音色和应答文本,进行语音合成生成应答音频,并返回应答音频给技能服务器501;
技能服务器501将得到的意图结果、资源服务结果、应答文本和应答音频这些语音处理结果发送至语音助手中控服务器。
在一种实现场景中,语音助手客户端位于终端侧,终端侧还包括语音采集单元,用于采集语音数据;其他实现语音信息处理的服务器位于服务器侧。
在另一些实现场景中,实现语音信息处理的部分或全部服务器也可以位于终端侧。
本申请实施例中还提供了一种语音信息处理装置,如图6所示,该装置包括:
获取部分601,配置为获取语音采集单元采集的语音信息;其中,所述语音信息包括第一语音信息,所述第一语音信息用于指示调用目标技能;
处理部分602,配置为基于预设的技能识别策略识别所述第一语音信息,确定所述第一语音信息指示调用的目标技能;
所述处理部分602,配置为从预设的第一映射关系中,确定所述目标技能对应的第一目标角色;其中,所述第一映射关系中包括至少三种角色与技能的映射关系;
控制部分603,配置为控制所述第一目标角色执行针对所述目标技能的语音播报。
在一些实施例中,所述控制所述第一目标角色执行针对所述目标技能的语音播报之前,所述处理部分,配置为确定当前执行语音播报的第二目标角色;
相应的,该装置还包括:切换部分,配置为所述第二目标角色和所述第一目标角色不同时,将当前执行语音播报的所述第二目标角色切换为所述第一目标角色。
在一些实施例中,所述语音信息还包括第二语音信息,所述第二语音信息用于指示唤醒所述第二目标角色;所述处理部分,配置为从所述语音信息中识别所述第二语音信息,并确定所述第二语音信息指示唤醒的第二目标角色;控制所述第二目标角色执行语音播报。
在一些实施例中,所述处理部分,配置为确定第二语音信息中的唤醒标识;从预设的第二映射关系中,确定所述唤醒标识对应的所述第二目标角色;其中,所述第二映射关系中包含至少三种角色与唤醒标识的映射关系。
在一些实施例中,所述控制部分,配置为获取所述第一目标角色的音色信息,以及语音文本信息;其中,不同角色对应不同音色信息;基于第一目标角色的音色信息和所述语音文本信息,合成语音音频信息;控制语音输出单元输出所述语音音频信息。
在一些实施例中,所述获取部分,配置为获取第三语音信息;其中,所述第三语音信息用于指示退出当前执行语音播报的第一目标角色;所述控制部分,配置为基于所述第三语音信息,控制退出所述第一目标角色。
在一些实施例中,所述处理部分,配置为确定所述第三语音信息中的退出标识;从预设的第三映射关系表中,确定所述退出标识对应的第一目标角色;其中,所述第三映射关系中包含至少三种角色与退出标识的映射关系;所述控制部分,配置为控制退出所述第一目标角色。
本申请实施例还提供了一种语音信息处理设备,如图7所示,该设备包括:处理器701和配置为存储能够在处理器上运行的计算机程序的存储器702;处理器701运行存储器702中计算机程序时实现前述实施例方法的步骤。
当然,实际应用时,如图7所示,该设备中的各个组件通过总线系统703耦合在一起。可理解,总线系统703用于实现这些组件之间的连接通信。总线系统703除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图7中将各种总线都标为总线系统703。
本申请实施例还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述任一实施例所述的方法的步骤。
在实际应用中,上述处理器可以为特定用途集成电路(ASIC,Application Specific Integrated Circuit)、数字信号处理装置(DSPD,Digital Signal Processing Device)、可编程逻辑装置(PLD,Programmable Logic Device)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、控制器、微控制器、微处理器中的至少一种。可以理解地,对于不同的设备,用于实现上述处理器功能的电子器件还可以为其它,本申请实施例不作具体限定。
上述存储器可以是易失性存储器(volatile memory),例如随机存取存储器(RAM,Random-Access Memory);或者非易失性存储器(non-volatile memory),例如只读存储器(ROM,Read-Only Memory),快闪存储器(flash memory),硬盘(HDD,Hard Disk Drive)或固态硬盘(SSD,Solid-State Drive);或者上述种类的存储器的组合,并向处理器提供指令和数据。
需要说明的是:“第一”、“第二”等是用于区别类似的对象,而不必用 于描述特定的顺序或先后次序。
本申请所提供的几个方法实施例中所揭露的方法,在不冲突的情况下可以任意组合,得到新的方法实施例。
本申请所提供的几个产品实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的产品实施例。
本申请所提供的几个方法或设备实施例中所揭露的特征,在不冲突的情况下可以任意组合,得到新的方法实施例或设备实施例。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (10)

  1. 一种语音信息处理方法,包括:
    获取语音采集单元采集的语音信息;其中,所述语音信息包括第一语音信息,所述第一语音信息用于指示调用目标技能;
    基于预设的技能识别策略识别所述第一语音信息,确定所述第一语音信息指示调用的目标技能;
    从预设的第一映射关系中,确定所述目标技能对应的第一目标角色;其中,所述第一映射关系中包括至少三种角色与技能的映射关系;
    控制所述第一目标角色执行针对所述目标技能的语音播报。
  2. 根据权利要求1所述的方法,其中,所述控制所述第一目标角色执行针对所述目标技能的语音播报之前,所述方法还包括:
    确定当前执行语音播报的第二目标角色;
    所述第二目标角色和所述第一目标角色不同时,将当前执行语音播报的所述第二目标角色切换为所述第一目标角色。
  3. 根据权利要求2所述的方法,其特征在于,所述语音信息还包括第二语音信息,所述第二语音信息用于指示唤醒所述第二目标角色;
    所述确定当前执行语音播报的第二目标角色之前,所述方法还包括:
    从所述语音信息中识别所述第二语音信息,并确定所述第二语音信息指示唤醒的第二目标角色;
    控制所述第二目标角色执行语音播报。
  4. 根据权利要求3所述的方法,其特征在于,所述确定所述第二语音信息指示唤醒的第二目标角色,包括:
    确定第二语音信息中的唤醒标识;
    从预设的第二映射关系中,确定所述唤醒标识对应的所述第二目标角色;其中,所述第二映射关系中包含至少三种角色与唤醒标识的映射 关系。
  5. 根据权利要求1所述的方法,其特征在于,所述控制所述第一目标角色执行针对所述目标技能的语音播报,包括:
    获取所述第一目标角色的音色信息,以及语音文本信息;其中,不同角色对应不同音色信息;
    基于第一目标角色的音色信息和所述语音文本信息,合成语音音频信息;
    控制语音输出单元输出所述语音音频信息。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述控制所述第一目标角色执行针对所述目标技能的语音播报之后,所述方法还包括:
    获取第三语音信息;其中,所述第三语音信息用于指示退出当前执行语音播报的第一目标角色;
    基于所述第三语音信息,控制退出所述第一目标角色。
  7. 根据权利要求6所述的方法,其特征在于,所述基于所述第三语音信息,控制退出所述第一目标角色,包括:
    确定所述第三语音信息中的退出标识;
    从预设的第三映射关系表中,确定所述退出标识对应的第一目标角色;其中,所述第三映射关系中包含至少三种角色与退出标识的映射关系;
    控制退出所述第一目标角色。
  8. 一种语音信息处理装置,包括:
    获取部分,配置为获取语音采集单元采集的语音信息;其中,所述语音信息包括第一语音信息,所述第一语音信息用于指示调用目标技能;
    处理部分,配置为基于预设的技能识别策略识别所述第一语音信息, 确定所述第一语音信息指示调用的目标技能;
    所述处理部分,配置为从预设的第一映射关系中,确定所述目标技能对应的第一目标角色;其中,所述第一映射关系中包括至少三种角色与技能的映射关系;
    控制部分,配置为控制所述第一目标角色执行针对所述目标技能的语音播报。
  9. 一种语音信息处理设备,包括:处理器和配置为存储能够在处理器上运行的计算机程序的存储器,
    其中,所述处理器配置为运行所述计算机程序时,执行权利要求1至7任一项所述方法的步骤。
  10. 一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现权利要求1至7任一项所述的方法的步骤。
PCT/CN2019/113943 2019-10-29 2019-10-29 语音信息处理方法、装置、设备及存储介质 WO2021081744A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980099978.9A CN114391165A (zh) 2019-10-29 2019-10-29 语音信息处理方法、装置、设备及存储介质
PCT/CN2019/113943 WO2021081744A1 (zh) 2019-10-29 2019-10-29 语音信息处理方法、装置、设备及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/113943 WO2021081744A1 (zh) 2019-10-29 2019-10-29 语音信息处理方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021081744A1 true WO2021081744A1 (zh) 2021-05-06

Family

ID=75714767

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/113943 WO2021081744A1 (zh) 2019-10-29 2019-10-29 语音信息处理方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN114391165A (zh)
WO (1) WO2021081744A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115001890A (zh) * 2022-05-31 2022-09-02 四川虹美智能科技有限公司 基于免应答的智能家电控制方法及装置
WO2022247825A1 (zh) * 2021-05-24 2022-12-01 维沃移动通信有限公司 信息播报方法和电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104464716A (zh) * 2014-11-20 2015-03-25 北京云知声信息技术有限公司 一种语音播报系统和方法
US20170083285A1 (en) * 2015-09-21 2017-03-23 Amazon Technologies, Inc. Device selection for providing a response
CN107122179A (zh) * 2017-03-31 2017-09-01 阿里巴巴集团控股有限公司 语音的功能控制方法和装置
CN108735211A (zh) * 2018-05-16 2018-11-02 智车优行科技(北京)有限公司 语音处理方法、装置、车辆、电子设备、程序及介质
CN109524010A (zh) * 2018-12-24 2019-03-26 出门问问信息科技有限公司 一种语音控制方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104464716A (zh) * 2014-11-20 2015-03-25 北京云知声信息技术有限公司 一种语音播报系统和方法
US20170083285A1 (en) * 2015-09-21 2017-03-23 Amazon Technologies, Inc. Device selection for providing a response
CN107122179A (zh) * 2017-03-31 2017-09-01 阿里巴巴集团控股有限公司 语音的功能控制方法和装置
CN108735211A (zh) * 2018-05-16 2018-11-02 智车优行科技(北京)有限公司 语音处理方法、装置、车辆、电子设备、程序及介质
CN109524010A (zh) * 2018-12-24 2019-03-26 出门问问信息科技有限公司 一种语音控制方法、装置、设备及存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022247825A1 (zh) * 2021-05-24 2022-12-01 维沃移动通信有限公司 信息播报方法和电子设备
CN115001890A (zh) * 2022-05-31 2022-09-02 四川虹美智能科技有限公司 基于免应答的智能家电控制方法及装置
CN115001890B (zh) * 2022-05-31 2023-10-31 四川虹美智能科技有限公司 基于免应答的智能家电控制方法及装置

Also Published As

Publication number Publication date
CN114391165A (zh) 2022-04-22

Similar Documents

Publication Publication Date Title
CN107340991B (zh) 语音角色的切换方法、装置、设备以及存储介质
CN107704275B (zh) 智能设备唤醒方法、装置、服务器及智能设备
CN110634483B (zh) 人机交互方法、装置、电子设备及存储介质
JP6505117B2 (ja) 模写によるデジタル携帯情報端末の対話、および応答時のリッチなマルチメディア
CN109637548A (zh) 基于声纹识别的语音交互方法及装置
CN111124123A (zh) 基于虚拟机器人形象的语音交互方法及装置、车载设备智能控制系统
CN111261151B (zh) 一种语音处理方法、装置、电子设备及存储介质
US10249296B1 (en) Application discovery and selection in language-based systems
WO2019242414A1 (zh) 语音处理方法、装置、存储介质及电子设备
CN107393534B (zh) 语音交互方法及装置、计算机装置及计算机可读存储介质
US10783884B2 (en) Electronic device-awakening method and apparatus, device and computer-readable storage medium
US11511200B2 (en) Game playing method and system based on a multimedia file
US20200265843A1 (en) Speech broadcast method, device and terminal
WO2019228138A1 (zh) 音乐播放方法、装置、存储介质及电子设备
WO2021081744A1 (zh) 语音信息处理方法、装置、设备及存储介质
WO2018076664A1 (zh) 一种语音播报的方法和装置
CN112912955B (zh) 提供基于语音识别的服务的电子装置和系统
CN109377979B (zh) 更新欢迎语的方法和系统
CN112185362A (zh) 针对用户个性化服务的语音处理方法及装置
CN108492826B (zh) 音频处理方法、装置、智能设备及介质
WO2021042584A1 (zh) 全双工语音对话方法
EP4293664A1 (en) Voiceprint recognition method, graphical interface, and electronic device
CN112114770A (zh) 基于语音交互的界面引导方法、装置及设备
CN111161734A (zh) 基于指定场景的语音交互方法及装置
CN110139187B (zh) 智能音箱控制方法、装置以及终端

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19950862

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19950862

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.10.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19950862

Country of ref document: EP

Kind code of ref document: A1