CN114420117A - Voice data processing method and device, computer equipment and storage medium - Google Patents

Voice data processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114420117A
CN114420117A CN202111626554.9A CN202111626554A CN114420117A CN 114420117 A CN114420117 A CN 114420117A CN 202111626554 A CN202111626554 A CN 202111626554A CN 114420117 A CN114420117 A CN 114420117A
Authority
CN
China
Prior art keywords
information
user
target
voice
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111626554.9A
Other languages
Chinese (zh)
Inventor
李文燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Volkswagen Mobvoi Beijing Information Technology Co Ltd
Original Assignee
Volkswagen Mobvoi Beijing Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Volkswagen Mobvoi Beijing Information Technology Co Ltd filed Critical Volkswagen Mobvoi Beijing Information Technology Co Ltd
Priority to CN202111626554.9A priority Critical patent/CN114420117A/en
Publication of CN114420117A publication Critical patent/CN114420117A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application relates to a voice data processing method, a voice data processing device, computer equipment and a storage medium. The method comprises the following steps: in response to the voice instruction, extracting a target word from the voice instruction; generating guide information according to the target words, wherein the guide information is used for guiding a user to input application related information corresponding to the target words; receiving application related information of a target word input based on the guide information; and predicting a broadcasting scene adapted to the target words according to the application related information, and associating the broadcasting scene with the target words. By adopting the method, the voice assistant database can be dynamically expanded in a mode of meeting the personalized requirements of the user.

Description

Voice data processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of voice technologies, and in particular, to a method and an apparatus for processing voice data, a computer device, and a storage medium.
Background
With the development of voice technology, voice assistants, which are artificial intelligence applications that complete instructions through voice interaction, have emerged.
The traditional voice assistant has no self-learning capability, and can not perform corresponding execution or response on information which does not exist in a voice assistant database under the condition of no software upgrade, the use range of a user is limited in a designed skill range in the process of using the voice assistant, the feedback functions of the voice assistant with the same software version of the same product are consistent, the feedback given by the voice assistant is uniform for different users, and the user can not dynamically expand voice instructions and can not create a database which meets the personalized requirements in the process of interacting with the voice assistant.
Disclosure of Invention
In view of the above, there is a need to provide a voice data processing method, apparatus, computer device and storage medium capable of personalizing and extending a voice instruction database during the use of a voice assistant.
A method of processing speech data, the method comprising:
in response to the voice instruction, extracting a target word from the voice instruction;
generating guide information according to the target words, wherein the guide information is used for guiding a user to input application related information corresponding to the target words;
receiving application related information of a target word input based on the guide information;
and predicting a broadcasting scene adapted to the target words according to the application related information, and associating the broadcasting scene with the target words.
In one embodiment, before generating the guidance information from the target word, the method further comprises: retrieving a target word from a voice assistant database; and if the target words can not be searched in the voice assistant database, the step of generating the guide information according to the target words is carried out.
In one embodiment, receiving application-related information of a target word input based on guidance information includes: receiving input data input based on the guidance information; analyzing the content information and the user emotion information according to the input data; and taking the content information and the emotion information of the user as application related information of the target word.
In one embodiment, generating the guidance information from the target word includes: guidance information for guiding the user to input the meaning of the target word is generated.
In one embodiment, generating the guidance information from the target word includes: and generating guide information for guiding the user to input the use mode of the target words.
In one embodiment, generating the guidance information from the target word includes: and generating guide information for guiding the user to input the use scene of the target words.
In one embodiment, generating the guidance information from the target word includes: and generating guide information for guiding the user to input the applicable sentence of the target word.
In one embodiment, generating the guidance information from the target word includes: and generating guide information for guiding the user to input the associated skills of the target words.
In one embodiment, the method further comprises: acquiring voice interaction data; determining a current conversation scene according to the voice interaction data; and when the current conversation scene is matched with the broadcasting scene associated with the target word, carrying out voice broadcasting based on the target word.
In one embodiment, the method further comprises: acquiring voice interaction data; determining a current conversation scene according to the voice interaction data; and when the current conversation scene is matched with the broadcasting scene associated with the target word, starting the associated skill of the target word.
In one embodiment, determining a current dialog scenario from voice interaction data includes: analyzing current conversation content information and current conversation emotion information according to the voice interaction data; and determining the current conversation scene according to the current conversation content information and the current conversation emotion information.
In one embodiment, the voice broadcasting based on the target words comprises: and playing the preset broadcast expression associated with the target word.
In one embodiment, the voice broadcasting based on the target words comprises: and generating broadcast content according to the current conversation content and by combining the application related information associated with the target words, and playing the broadcast content.
A speech data processing apparatus, the apparatus comprising:
the voice instruction acquisition module is used for responding to the voice instruction and extracting the target words from the voice instruction;
the guiding information generating module is used for generating guiding information according to the target words, and the guiding information is used for guiding the user to input application related information corresponding to the target words;
the application information receiving module is used for receiving application related information of the target words input based on the guide information;
and the broadcasting scene prediction module is used for predicting the broadcasting scene adapted to the target words according to the application related information and associating the broadcasting scene with the target words.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above-mentioned speech data processing method when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned speech data processing method.
According to the voice data processing method, the voice data processing device, the computer equipment and the storage medium, the target words in the voice command are extracted, the guide information is generated according to the target words to guide the user to input the application related information of the target words, and the target words are predicted and associated with the applicable broadcast scenes according to the application related information which is input by the user and can reflect the personalized requirements of the user. According to the scheme, in the process of interaction between the terminal voice assistant and the user, the terminal can dynamically extract the target words from the voice command, guide the user to input personalized application related information for the target words, and predict the adaptive broadcast scene of the target words, so that the autonomous learning of the target words is completed, and the aim of dynamically expanding the voice assistant database in a mode of meeting the personalized requirements of the user is fulfilled.
Drawings
FIG. 1 is a diagram of an exemplary implementation of a method for processing speech data;
FIG. 2 is a flow diagram illustrating a method for processing speech data in one embodiment;
FIG. 3 is a flow chart illustrating a process of a terminal voice assistant implementing personalized word learning by interacting with a user in an application example;
fig. 4 is a schematic flowchart of a broadcast feedback performed by a terminal voice assistant in an application example;
FIG. 5 is a block diagram showing the structure of a speech data processing apparatus according to an embodiment;
FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, the voice data processing method provided by the present application can be applied to an application environment as shown in fig. 1. Specifically, the terminal 102 extracts a target word from a voice instruction in response to the voice instruction; generating guide information according to the target words, wherein the guide information is used for guiding a user to input application related information corresponding to the target words; receiving application related information of a target word input based on the guide information; and predicting a broadcasting scene adapted to the target words according to the application related information, and associating the broadcasting scene with the target words. The terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, vehicle-mounted terminal devices, voice speakers, and portable wearable devices.
In one embodiment, as shown in fig. 2, a method for processing voice data is provided, which is described by taking the method as an example applied to the terminal in fig. 1, and includes the following steps:
step S202: in response to the voice instruction, a target word is extracted from the voice instruction.
The voice command refers to a command for triggering some controls through voice. The target word refers to a certain word or words determined as a target. The target words may not be idiomatic words already stored in the voice assistant database, for example, they may be personalized words that reflect the user's personal preferences, or words that are self-created by the user. The user refers to any person such as a driver and a tester who performs voice interaction with the terminal, or any artificial intelligence device which can simulate a person to send or trigger a voice instruction, and the like.
In this step, the user may input a voice command to the terminal, where the voice command may be one or more words, or the like. The terminal responds to a voice command input by a user, analyzes the voice command, extracts at least one word from the voice command through analysis, and determines an individualized word capable of reflecting personal preference of the user in the extracted words as a target word.
Step S204: and generating guide information according to the target words, wherein the guide information is used for guiding the user to input the application related information corresponding to the target words.
The guidance information is prompt information for guiding a user to input application-related information corresponding to the target word. The guiding information can be fed back to the user through interface display or voice playing and the like to play a guiding and prompting role. The application related information refers to any information related to how to use the target word, and may include, for example, but not limited to, information about the meaning, the use mode, the use scenario, the applicable sentence, the applicable object, the associated skill, and the like of the target word.
Illustratively, the terminal may present the guidance information through the display interface to guide a user to enter or designate answer information corresponding to the guidance information into the terminal through an external device, such as a mouse, a keyboard, a touch screen, and the like, receive the answer information entered or designated by the user, and determine the answer information as application-related information of the target word.
Illustratively, the terminal may also play the guidance information through the playing device to guide the user to input corresponding answer information in a voice mode, collect voice data input by the user in the voice mode, recognize and analyze the corresponding answer information from the voice data, and determine the analyzed answer information as the application related information of the target word.
Step S206: application related information of a target word input based on the guide information is received.
In this step, the terminal feeds back the generated guidance information to the user, and the user interacts with the terminal to enable the terminal to receive application related information which is input by the user and used for representing the content of the target words, such as meanings, using modes, using scenes, applicable sentences and/or associated skills.
Step S208: and predicting a broadcasting scene adapted to the target words according to the application related information, and associating the broadcasting scene with the target words.
For example, after receiving application related information, which is input by a user based on the guidance information and used for representing the content of the target word, such as meaning, usage mode, usage scenario, applicable statement and/or associated skill, the terminal may input the application related information into the prediction model as an input parameter of the prediction model by using the prediction model to perform comprehensive analysis and calculation, output a predicted broadcast scenario adapted to the target word, and associate the predicted broadcast scenario with the target word.
The prediction model is a mathematical model used for predicting the broadcasting scene of the target words, the relation between parameter variables can be described by using a mathematical language or a formula, and the prediction model can be obtained based on deep learning algorithm training.
According to the voice data processing method, the target words in the voice command are extracted, the guide information is generated according to the target words to guide the user to input the application related information of the target words, and the target words are predicted and associated with the applicable broadcasting scene according to the application related information which is input by the user and can reflect the personalized requirements of the user. According to the scheme, in the process of interaction between the terminal voice assistant and the user, the terminal can dynamically extract the target words from the voice command, guide the user to input personalized application related information for the target words, and predict the broadcasting scene adaptive to the target words, so that autonomous learning of the target words is completed, and the purpose of dynamically expanding the voice assistant database in a mode of meeting personalized requirements of the user is achieved.
Furthermore, in the subsequent voice interaction process, the terminal voice assistant can trigger feedback based on the broadcast scene associated with the target word, so that the terminal can return broadcast contents meeting the personalized requirements of the user, and the accuracy of the feedback contents is improved.
In one embodiment, before generating the guidance information according to the target word, the method further includes: retrieving a target word from a voice assistant database; and if the target words can not be searched in the voice assistant database, the step of generating the guide information according to the target words is carried out.
In this embodiment, the terminal may extract at least one word from the voice command as a target word, and retrieve the target word in the current voice assistant database. The current voice assistant database may include a plurality of pre-configured instruction words and corresponding feedback techniques, feedback skills, etc., and if the target word is not retrieved from the current voice assistant database, it indicates that the target word does not belong to the instruction words that are preset and stored in the voice assistant database.
However, in the solution of this embodiment, a word that does not exist in the current voice assistant database is used as a personalized term that can reflect the personal preference of the user, and the learning mode is started, so as to enter the step of generating the guidance information according to the target word. Therefore, in the embodiment, the original voice assistant database can be dynamically personalized and expanded in the process of interaction between the terminal and the user, and the voice assistant database which meets the personalized requirements of the user and is exclusive to the user is gradually constructed in the using process.
In one embodiment, generating the guidance information from the target word includes: guidance information for guiding the user to input the meaning of the target word is generated.
In one embodiment, generating the guidance information from the target word includes: and generating guide information for guiding the user to input the use mode of the target words.
In one embodiment, generating the guidance information from the target word includes: and generating guide information for guiding the user to input the applicable sentence of the target word.
In one embodiment, generating the guidance information from the target word includes: and generating guide information for guiding the user to input the use scene of the target words.
In one embodiment, generating the guidance information from the target word includes: and generating guide information for guiding the user to input the associated skills of the target words.
In the embodiments of generating the guidance information according to the target words, the terminal may generate the guidance information capable of guiding the user to input the application related information of the target words from different angles according to the business requirements or the pre-configuration of the operator, and the user may comprehensively configure the application related information of the target words according to the guidance information of different angles, and construct a unique instruction of the user according to the personalized requirements.
Illustratively, the terminal may generate guidance information that guides the user to input the meaning and usage of the target word, for example, "you can tell me the meaning and usage of" YYDS ", please enter in the following format: it means that it is used by the method. "
Further, the terminal may also generate guidance information that guides the user to input a usage scenario of the target word and an applicable sentence. For example, a plurality of different usage scenarios may be preset, and the usage scenarios may include different scenarios such as delegation, negation, anger, and distraction, or at least one corresponding applicable phrase may be preset for each usage scenario. Specifically, the terminal can call a preset usage scenario according to the target word and feed back the usage scenario to the user to guide the user to select an applicable usage scenario for the target word from a plurality of preset usage scenarios, the user can select one or more usage scenarios according to the guidance of the terminal, and the terminal can further feed back an applicable statement associated with the usage scenario to the user according to the usage scenario selected by the user to guide the user to select an applicable broadcast phrase for the target word. Of course, the user may also manually or phonetically input the applicable sentence of the target word according to the guidance information. Further, the user may associate a corresponding skill with the target word, for example, a skill for dialing a target phone, a skill for playing target music, a skill for turning off or on a target device, and the like may be associated with the target word.
In one embodiment, receiving application-related information of a target word input based on guidance information includes: receiving input data input based on the guidance information; analyzing the content information and the user emotion information according to the input data; and taking the content information and the emotion information of the user as application related information of the target word.
In this embodiment, the input data refers to data input to the terminal by the user based on the guidance information, and the input data may include voice data and/or text data. Illustratively, the user can input the voice data which represents the relevant information of the target word application by means of voice input, and can also input the text data by means of manual input. The terminal can collect input data input to the terminal by a user, analyze the collected input data to obtain content information and user emotion information, and take the content information and the user emotion information as application related information corresponding to the target word. The content information is information representing the content of the target word input by the user with respect to the guidance information, such as the meaning, the usage mode, the usage scene, the applicable sentence, the applicable object, and/or the associated skill. The user emotion information is information which can represent the emotion state of the user when the user inputs voice data and/or text data.
For example, the terminal may compare the tone, the pitch, and the preset regular tone, the pitch, when the user inputs voice data based on the guide information to analyze whether the user's emotion is calm, positive, or negative, etc. when the voice data is input. Alternatively, whether words capable of representing emotional states are contained in text data and/or voice data input by the user can be analyzed, and the emotional states of the user in the voice data can be deduced through the words capable of representing the emotional states.
Illustratively, the terminal receives and records the information of the meaning, the using mode, the using scene, the applicable sentence, the applicable object and/or the associated skill of the target word input by the user, and simultaneously analyzes and records the emotional state of the user when the target word is interpreted. For example, when the user interprets "YYDS", the emotional state is more excited and oriented upward, and the emotional state information at the time of data input by the user is specified to be excited, and the terminal may comprehensively estimate the broadcast scene of the target word by combining the specified emotional state information with content information such as the meaning, usage mode, usage scene, applicable sentence, applicable subject, and/or related skill of the target word.
In this embodiment, not only the content information carried in the data input by the user according to the guidance information is used as the application related information of the target word, but also the information capable of reflecting the emotional state of the user when inputting the data is used as the application related information of the target word. Therefore, the terminal can comprehensively predict the broadcasting scene of the target words according to the content information and by combining the emotion information of the user, so that the accuracy of prediction of the broadcasting scene can be improved, and the feedback accuracy in the voice interaction process of the terminal and the user is improved.
In one embodiment, the method further comprises: acquiring voice interaction data; determining a current conversation scene according to the voice interaction data; and when the current conversation scene is matched with the broadcasting scene associated with the target word, carrying out voice broadcasting based on the target word.
In this embodiment, in the process of voice interaction between a user and a terminal, the terminal may collect voice interaction data in real time, analyze a current dialog scene related to a current dialog through a voice analysis technology, generate a broadcast content according to a target word or call a broadcast sentence related to the target word when it is determined that the current dialog scene matches the broadcast scene related to the target word, and actively feed back the sentence or content related to the target word to the user in the process of voice interaction between the terminal and the user.
In one embodiment, the method further comprises: acquiring voice interaction data; determining a current conversation scene according to the voice interaction data; and when the current conversation scene is matched with the broadcasting scene associated with the target word, starting the associated skill of the target word.
In this embodiment, the terminal may also start the skills personalized and associated with the target word according to the target word in the determined adaptive dialog scene, so that a dedicated instruction linguist may be created for the user. For example, when a user encounters an emergency dangerous situation during driving, the target word can be used as an instruction secret, and the associated skill of the target word is started without being perceived by others, for example, an emergency call for help is dialed.
In one embodiment, determining a current dialog scenario from voice interaction data includes: analyzing current conversation content information and current conversation emotion information according to the voice interaction data; and determining the current conversation scene according to the current conversation content information and the current conversation emotion information.
In this embodiment, when a terminal detects a negative phrase from voice interaction data by analyzing voice interaction data generated during a conversation with a user, the terminal may determine that a current conversation scene is a negative scene by using the detected negative phrase as a basis for determining current conversation emotion information. Further, when the terminal determines that the current conversation scene matches the broadcast scene associated with the target word, a preset broadcast statement associated with the target word in advance can be called according to the target word, or personalized broadcast content is generated by combining the conversation content and the like, and broadcast feedback is actively carried out on the user in the conversation process. For example, a broadcast "the owner is present in my heart as YYDS" may be generated to thereby sooth the negative emotion of the user, which may help the user to adjust to a good emotional state as soon as possible, and may improve the driving safety of the user, particularly when the user is currently in a driving state.
In one embodiment, the voice broadcasting based on the target words comprises: and playing the preset broadcast expression associated with the target word.
In this embodiment, in the speech assistant learning stage, the terminal may generate guidance information for guiding the user to select an applicable sentence, the user may associate the applicable sentence with the target word based on the guidance information, and in a subsequent speech interaction process, when the terminal determines that the current dialog scene matches with the broadcast scene associated with the target word, the terminal may directly play the applicable sentence associated with the target word by the user as the preset broadcast word.
In one embodiment, the voice broadcasting based on the target words comprises: and generating broadcast contents according to the preset broadcast phrases related to the current conversation contents and the target words, and broadcasting the broadcast contents.
In this embodiment, the terminal may analyze the current conversation content from the voice interaction data collected in real time, and combine the application related information of the recorded and learned target word with the current conversation content to generate a personalized broadcast content pattern that meets the current conversation content and meets the user requirements, and broadcast the content pattern.
Next, a voice data processing method related to the present application is further described in detail with reference to an application example, where the application example may be implemented based on a voice assistant application program loaded by a terminal, and fig. 3 and fig. 4 are referred to, where fig. 3 is a schematic flowchart of a terminal voice assistant in an application example implementing personalized word learning by interacting with a user, and fig. 4 is a schematic flowchart of a terminal voice assistant in an application example playing a broadcast feedback. The application example specifically may include the following:
1. when a user initiates a voice command, the user uses a personal word, for example, the user says "do you know what is YYDS? ".
2. The voice assistant handle 'YYDS' is extracted as a slot bit value to be detected, when the word is detected to be absent in the current database, the learning mode is started, and guide information is fed back to a user: "you can tell me the meaning and usage of" YYDS ", please enter in the following format: its meaning is … … and its usage is … … "to guide the user to input application related information such as meaning, usage mode, usage scenario, etc. of the word. And several use scenes and corresponding applicable sentences under different use scenes can be preset for the user to select, for example, the use scenes can include different scenes such as consignment, negativity, vitality, distraction and the like.
3. The user describes the meaning, the use mode, the use scene and other application related information of the word according to the guiding information, for example, the user description: its meaning is praise others, is the spelling abbreviation of "forever god", and its mode of use is that praise someone is particularly excellent. And may choose to select at least one of the preset usage scenarios, for example, the usage scenario selected by the user is a negative scenario, i.e., it is desirable for the voice assistant to be able to use the word in a negative scenario. Meanwhile, the user can be supported to manually or voice input the applicable sentence of the word and the like.
4. The voice assistant records content information such as the meaning, the using mode and the using scene of the word, records user emotion information of a user when the user explains the word, and conjectures the broadcast sentence and the broadcast scene which can be generated by the word according to the content information and the user emotion information. For example, when the user interprets "YYDS", the emotional state is excited and positive, the voice assistant compares the usage scene selected by the user with the user emotion information in combination with the content information such as the meaning, usage mode, usage scene, etc. of the word input by the user, and can presume that the word can be used in a scene where the user emotion is low and negative, so that the low and negative scene can be used as the broadcast scene of the word.
5. When the voice assistant detects that the user speaks through the voice interaction data, the current conversation emotion is in a low-falling negative state, some negative expressions are expressed in a conversation scene, the voice assistant can generate a feedback technique according to the recorded words and the preset broadcast expressions of the words, and actively feed back, for example, the situation that the owner is YYDS in the heart is used for soothing the emotion of the user, the user can be helped to adjust to a good emotion state as soon as possible, and safe driving is guaranteed.
In the above application example, the voice assistant collects and analyzes the meaning and the usage scenario of the individual word used by the user (that is, a word not existing in the voice assistant database, or a word created by the user), and records the emotional state of the user when describing the individual word. When the conversation scene or the conversation content of the user is in accordance with the broadcasting scene of the personalized word, the voice assistant can generate the personalized broadcasting file for broadcasting by using the personalized word and combining the current conversation emotion and the conversation content, so that the emotion state of the user is adjusted. By adopting the scheme of the application example, the voice assistant database which accords with the personal language habit of the user can be created in the using process of the voice assistant, and the driving emotion of the user can be better guaranteed and the driving safety can be improved when the vehicle-mounted voice assistant is used.
It should be understood that although the various steps in the flow charts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 5, there is provided a voice data processing apparatus including: voice instruction acquisition module 510, guide information generation module 520, application information receiving module 530, and broadcast scene prediction module 540, wherein:
a voice instruction obtaining module 510, configured to, in response to a voice instruction, extract a target word from the voice instruction;
a guidance information generating module 520, configured to generate guidance information according to the target word, where the guidance information is used to guide a user to input application related information corresponding to the target word;
an application information receiving module 530 for receiving application related information of the target words input based on the guide information;
and the broadcast scene prediction module 540 is configured to predict a broadcast scene adapted to the target word according to the application related information, and associate the broadcast scene with the target word.
In one embodiment, the voice instruction acquisition module 510 is further configured to retrieve the target term in a voice assistant database; and if the target words can not be searched in the voice assistant database, the step of generating the guide information according to the target words is carried out.
In one embodiment, the application information receiving module 530 receives input data input based on the guidance information; analyzing the content information and the user emotion information according to the input data; and taking the content information and the emotion information of the user as application related information of the target word.
In one embodiment, the guidance information generation module 520 generates guidance information for guiding the user to input the meaning of the target word.
In one embodiment, the guidance information generation module 520 generates guidance information for guiding the user to input the usage pattern of the target word.
In one embodiment, the guidance information generation module 520 generates guidance information for guiding the user to input a usage scenario of the target word.
In one embodiment, the guidance information generation module 520 generates guidance information for guiding the user to input the applicable sentence of the target word.
In one embodiment, the guidance information generation module 520 generates guidance information for guiding the user to input the associated skills of the target words.
In one embodiment, the apparatus further includes a voice broadcast module 550, configured to obtain voice interaction data; determining a current conversation scene according to the voice interaction data; and when the current conversation scene is matched with the broadcasting scene associated with the target word, carrying out voice broadcasting based on the target word.
In one embodiment, the voice broadcasting module 550 is further configured to obtain voice interaction data; determining a current conversation scene according to the voice interaction data; and when the current conversation scene is matched with the broadcasting scene associated with the target word, starting the associated skill of the target word.
In one embodiment, the voice broadcast module 550 parses the current conversation content information and the current conversation emotion information according to the voice interaction data; and determining the current conversation scene according to the current conversation content information and the current conversation emotion information.
In one embodiment, the voice broadcast module 550 broadcasts the preset broadcast phrase associated with the target word.
In one embodiment, the voice broadcast module 550 generates broadcast content according to the current dialog content and in combination with the application related information associated with the target word, and plays the broadcast content.
For the specific limitation of the voice data processing apparatus, reference may be made to the above limitation of the voice data processing method, which is not described herein again. The respective modules in the above-described voice data processing apparatus may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech data processing method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: in response to the voice instruction, extracting a target word from the voice instruction; generating guide information according to the target words, wherein the guide information is used for guiding a user to input application related information corresponding to the target words; receiving application related information of a target word input based on the guide information; and predicting a broadcasting scene adapted to the target words according to the application related information, and associating the broadcasting scene with the target words.
In one embodiment, before the processor executes the computer program to generate the guidance information according to the target words, the following steps are specifically realized: retrieving a target word from a voice assistant database; and if the target words can not be searched in the voice assistant database, the step of generating the guide information according to the target words is carried out.
In one embodiment, when the processor executes the computer program to receive the application related information of the target words input based on the guiding information, the following steps are specifically realized: receiving input data input based on the guidance information; analyzing the content information and the user emotion information according to the input data; and taking the content information and the emotion information of the user as application related information of the target word.
In one embodiment, when the processor executes the computer program to generate the guidance information according to the target words, the following steps are specifically realized: guidance information for guiding the user to input the meaning of the target word is generated.
In one embodiment, when the processor executes the computer program to generate the guidance information according to the target words, the following steps are specifically realized: and generating guide information for guiding the user to input the use mode of the target words.
In one embodiment, when the processor executes the computer program to generate the guidance information according to the target words, the following steps are specifically realized: and generating guide information for guiding the user to input the use scene of the target words.
In one embodiment, when the processor executes the computer program to generate the guidance information according to the target words, the following steps are specifically realized: and generating guide information for guiding the user to input the applicable sentence of the target word.
In one embodiment, when the processor executes the computer program to generate the guidance information according to the target words, the following steps are specifically realized: and generating guide information for guiding the user to input the associated skills of the target words.
In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring voice interaction data; determining a current conversation scene according to the voice interaction data; and when the current conversation scene is matched with the broadcasting scene associated with the target word, carrying out voice broadcasting based on the target word.
In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring voice interaction data; determining a current conversation scene according to the voice interaction data; and when the current conversation scene is matched with the broadcasting scene associated with the target word, starting the associated skill of the target word.
In one embodiment, when the processor executes the computer program to determine the current dialog scenario according to the voice interaction data, the following steps are specifically implemented: analyzing current conversation content information and current conversation emotion information according to the voice interaction data; and determining the current conversation scene according to the current conversation content information and the current conversation emotion information.
In one embodiment, when the processor executes the computer program to realize voice broadcasting based on the target words, the following steps are specifically realized: and playing the preset broadcast expression associated with the target word.
In one embodiment, when the processor executes the computer program to realize voice broadcasting based on the target words, the following steps are specifically realized: and generating broadcast content according to the current conversation content and by combining the application related information associated with the target words, and playing the broadcast content.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: in response to the voice instruction, extracting a target word from the voice instruction; generating guide information according to the target words, wherein the guide information is used for guiding a user to input application related information corresponding to the target words; receiving application related information of a target word input based on the guide information; and predicting a broadcasting scene adapted to the target words according to the application related information, and associating the broadcasting scene with the target words.
In one embodiment, before the computer program is executed by the processor to generate the guidance information according to the target words, the following steps are specifically realized: retrieving a target word from a voice assistant database; and if the target words can not be searched in the voice assistant database, the step of generating the guide information according to the target words is carried out.
In one embodiment, when executed by a processor, the computer program implements the following steps when receiving application related information of a target word input based on guide information: receiving voice data input based on the guide information; analyzing the content information and the user emotion information according to the voice data; and taking the content information and the emotion information of the user as application related information of the target word.
In one embodiment, when the computer program is executed by the processor to generate the guidance information according to the target words, the following steps are specifically realized: guidance information for guiding the user to input the meaning of the target word is generated.
In one embodiment, when the computer program is executed by the processor to generate the guidance information according to the target words, the following steps are specifically realized: and generating guide information for guiding the user to input the use mode of the target words.
In one embodiment, when the computer program is executed by the processor to generate the guidance information according to the target words, the following steps are specifically realized: and generating guide information for guiding the user to input the use scene of the target words.
In one embodiment, when the computer program is executed by the processor to generate the guidance information according to the target words, the following steps are specifically realized: and generating guide information for guiding the user to input the applicable sentence of the target word.
In one embodiment, when the computer program is executed by the processor to generate the guidance information according to the target words, the following steps are specifically realized: and generating guide information for guiding the user to input the associated skills of the target words.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring voice interaction data; determining a current conversation scene according to the voice interaction data; and when the current conversation scene is matched with the broadcasting scene associated with the target word, carrying out voice broadcasting based on the target word.
In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring voice interaction data; determining a current conversation scene according to the voice interaction data; and when the current conversation scene is matched with the broadcasting scene associated with the target word, starting the associated skill of the target word.
In one embodiment, the computer program when executed by the processor for implementing the determining of the current dialog scenario from the voice interaction data specifically implements the following steps: analyzing current conversation content information and current conversation emotion information according to the voice interaction data; and determining the current conversation scene according to the current conversation content information and the current conversation emotion information.
In one embodiment, when the computer program is executed by the processor to implement voice announcement based on the target word, the following steps are specifically implemented: and playing the preset broadcast expression associated with the target word.
In one embodiment, when the computer program is executed by the processor to implement voice announcement based on the target word, the following steps are specifically implemented: and generating broadcast content according to the current conversation content and by combining the application related information associated with the target words, and playing the broadcast content.
It will be understood by those skilled in the art that all or part of the processes of the methods for implementing the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, the computer program can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of speech data processing, the method comprising:
in response to a voice instruction, extracting a target word from the voice instruction;
generating guide information according to the target words, wherein the guide information is used for guiding a user to input application related information corresponding to the target words;
receiving application-related information of the target words input based on the guide information;
and predicting a broadcasting scene adapted to the target words according to the application related information, and associating the broadcasting scene with the target words.
2. The method of claim 1, wherein prior to the generating guidance information from the target words, the method further comprises:
retrieving the target word in a voice assistant database;
and if the target word is not searched in the voice assistant database, entering the step of generating guide information according to the target word.
3. The method of claim 1, wherein receiving application-related information for the target term based on the guidance information input comprises:
receiving input data input based on the guidance information;
analyzing content information and user emotion information according to the input data;
and using the content information and the user emotion information as the application related information of the target word.
4. The method of claim 1, wherein generating guidance information from the target term comprises:
generating guide information for guiding a user to input the meaning of the target word; and/or the presence of a gas in the gas,
generating guide information for guiding a user to input a use mode of the target words; and/or the presence of a gas in the gas,
generating guide information for guiding a user to input a usage scenario of the target word; and/or the presence of a gas in the gas,
generating guide information for guiding a user to input an applicable sentence of the target word; and/or the presence of a gas in the gas,
generating guide information for guiding a user to input the associated skills of the target words.
5. The method according to any one of claims 1 to 4, further comprising:
acquiring voice interaction data;
determining a current conversation scene according to the voice interaction data;
and when the current conversation scene is matched with the broadcasting scene associated with the target word, performing voice broadcasting based on the target word, and/or when the current conversation scene is matched with the broadcasting scene associated with the target word, starting the associated skill of the target word.
6. The method of claim 5, wherein determining a current dialog scenario from the voice interaction data comprises:
analyzing current conversation content information and current conversation emotion information according to the voice interaction data;
and determining the current conversation scene according to the current conversation content information and the current conversation emotion information.
7. The method according to claim 6, wherein the voice broadcasting based on the target words comprises:
playing preset broadcast phrases associated with the target words; or the like, or, alternatively,
and generating broadcast content according to the current conversation content and by combining the application related information associated with the target words, and playing the broadcast content.
8. A speech data processing apparatus, characterized in that the apparatus comprises:
the voice instruction acquisition module is used for responding to a voice instruction and extracting a target word from the voice instruction;
the guiding information generating module is used for generating guiding information according to the target words, and the guiding information is used for guiding a user to input application related information corresponding to the target words;
the application information receiving module is used for receiving application related information of the target words input based on the guide information;
and the broadcasting scene prediction module is used for predicting the broadcasting scene adapted to the target words according to the application related information and associating the broadcasting scene with the target words.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202111626554.9A 2021-12-28 2021-12-28 Voice data processing method and device, computer equipment and storage medium Pending CN114420117A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111626554.9A CN114420117A (en) 2021-12-28 2021-12-28 Voice data processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111626554.9A CN114420117A (en) 2021-12-28 2021-12-28 Voice data processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114420117A true CN114420117A (en) 2022-04-29

Family

ID=81270231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111626554.9A Pending CN114420117A (en) 2021-12-28 2021-12-28 Voice data processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114420117A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116016578A (en) * 2022-11-22 2023-04-25 中国第一汽车股份有限公司 Intelligent voice guiding method based on equipment state and user behavior

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116016578A (en) * 2022-11-22 2023-04-25 中国第一汽车股份有限公司 Intelligent voice guiding method based on equipment state and user behavior
CN116016578B (en) * 2022-11-22 2024-04-16 中国第一汽车股份有限公司 Intelligent voice guiding method based on equipment state and user behavior

Similar Documents

Publication Publication Date Title
CN110998720B (en) Voice data processing method and electronic device supporting the same
EP2492910B1 (en) Speech translation system, control device and control method
KR20200014510A (en) Method for providing prediction service based on mahcine-learning and apparatus thereof
JPWO2015075975A1 (en) Dialog control apparatus and dialog control method
CN111209380B (en) Control method and device for conversation robot, computer equipment and storage medium
CN109074804B (en) Accent-based speech recognition processing method, electronic device, and storage medium
KR20200023088A (en) Electronic apparatus for processing user utterance and controlling method thereof
KR20210020656A (en) Apparatus for voice recognition using artificial intelligence and apparatus for the same
WO2016040402A1 (en) Actions on digital document elements from voice
CN110808038A (en) Mandarin assessment method, device, equipment and storage medium
CN112214607B (en) Interactive method, psychological intervention system, terminal and medium based on artificial intelligence
CN111897601B (en) Application starting method, device, terminal equipment and storage medium
CN110931002B (en) Man-machine interaction method, device, computer equipment and storage medium
CN114420117A (en) Voice data processing method and device, computer equipment and storage medium
CN109388792B (en) Text processing method, device, equipment, computer equipment and storage medium
KR20080114100A (en) Method and apparatus of naturally talking with computer
KR20190021136A (en) System and device for generating TTS model
CN117076635A (en) Information processing method, apparatus, device and storage medium
KR20160138613A (en) Method for auto interpreting using emoticon and apparatus using the same
US11749270B2 (en) Output apparatus, output method and non-transitory computer-readable recording medium
KR20190083884A (en) Method for displaying an electronic document for processing a voice command and electronic device thereof
CN113987142A (en) Voice intelligent interaction method, device, equipment and storage medium with virtual doll
Jeong et al. A computer remote control system based on speech recognition technologies of mobile devices and wireless communication technologies
KR20200058612A (en) Artificial intelligence speaker and talk progress method using the artificial intelligence speaker
KR20220045741A (en) Apparatus, method and computer program for providing voice recognition service

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination