WO2020119541A1 - 一种语音数据的识别方法、装置及系统 - Google Patents

一种语音数据的识别方法、装置及系统 Download PDF

Info

Publication number
WO2020119541A1
WO2020119541A1 PCT/CN2019/122933 CN2019122933W WO2020119541A1 WO 2020119541 A1 WO2020119541 A1 WO 2020119541A1 CN 2019122933 W CN2019122933 W CN 2019122933W WO 2020119541 A1 WO2020119541 A1 WO 2020119541A1
Authority
WO
WIPO (PCT)
Prior art keywords
client
data
voice data
recognition result
user
Prior art date
Application number
PCT/CN2019/122933
Other languages
English (en)
French (fr)
Inventor
祝俊
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020119541A1 publication Critical patent/WO2020119541A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/228Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of application context

Definitions

  • the present invention relates to the field of voice processing technology, and in particular, to a voice data recognition method, device, and system.
  • the intelligent voice device can recognize the voice data input by the user through voice recognition technology, thereby providing personalized services for the user.
  • voice recognition technology such as "Tianxia”, “Sweet Shrimp”, and "Tian Xia”.
  • the traditional speech recognition scheme cannot distinguish these words well, which will inevitably affect the user's Interactive experience.
  • the present invention provides a voice data recognition method, device and system, in an effort to solve or at least alleviate at least one of the above problems.
  • a method for recognizing voice data including the steps of: obtaining voice data and scene information of a client; recognizing voice data to generate a first recognition result of voice data; A recognition result is recognized, and a second recognition result of the voice data is generated.
  • the first recognition result is recognized according to the scene information
  • the step of generating the second recognition result of the voice data includes: determining the current service of the client based on the first recognition result and the scene information Scene; recognize the first recognition result according to the business scenario to generate the second recognition result of the voice data.
  • the step of recognizing the first recognition result according to the business scenario to generate the second recognition result of the voice data further includes: extracting the entity to be determined from the first recognition result; according to the business The scenario obtains at least one candidate entity from the client; matches at least one candidate entity from the at least one candidate entity; and generates a second recognition result according to the matched entity.
  • the method according to the present invention further includes the step of instructing the client to enter a working state if the voice data contains a predetermined object.
  • the method according to the present invention further includes the steps of: obtaining a representation of the user's intention based on the generated second recognition result, and generating an instruction response; and outputting the instruction response.
  • the scene information includes one or more of the following information: the client's process data, the client's application list, the application usage history data on the client, and the user's personal data associated with the client 2.
  • Data obtained from the conversation history data obtained from at least one sensor of the client, text data in the display page of the client, and input data provided by the user in advance.
  • the step of matching the entity to be determined from the at least one candidate entity to an entity includes: calculating the similarity value of the at least one candidate entity and the entity to be determined respectively; and selecting the similarity value The largest candidate entity is the matched entity.
  • a voice data recognition method including the steps of: acquiring voice data and scene information of a client; and recognizing voice data according to the scene information to generate a recognition result of the voice data.
  • a voice data recognition apparatus including: a connection management unit adapted to obtain voice data from a client and scene information of the client; a first processing unit adapted to recognize voice data, Generating a first recognition result of voice data; and a second processing unit, adapted to recognize the first recognition result according to the scene information, and generate a second recognition result of the voice data.
  • the second processing unit includes: a business scene determination module, adapted to determine the current business scene of the client based on the first recognition result and scene information; an enhanced processing module, adapted to The scene recognizes the first recognition result to generate a second recognition result of the voice data.
  • the enhanced processing module includes: an entity acquisition module adapted to extract the entity to be determined from the first recognition result, and is further adapted to acquire at least one candidate entity from the client according to the business scenario;
  • the matching module is adapted to match an entity to be determined from at least one candidate entity to an entity;
  • the generation module is adapted to generate a second recognition result according to the matched entity.
  • the scene information includes one or more of the following information: the client's process data, the client's application list, the application usage history data on the client, and the user's personal data associated with the client 2.
  • Data obtained from the conversation history data obtained from at least one sensor of the client, text data in the display page of the client, and input data provided by the user in advance.
  • a voice data recognition system including: a client, a recognition device adapted to receive user voice data and transmit to voice data; and a server, including voice data recognition as described above The device is adapted to recognize the voice data from the client to generate a corresponding second recognition result.
  • the voice data recognition device is further adapted to obtain a representation of the user's intention based on the generated second recognition result, and to generate an instruction response, and is also adapted to output the instruction response to the client; and
  • the client is also adapted to perform corresponding operations in response to instructions.
  • the voice data recognition device is further adapted to instruct the client to enter a working state when the voice data from the client contains a predetermined object.
  • a smart speaker including: an interface unit adapted to obtain voice data input by a user; an interaction unit adapted to obtain current scene information in response to the user input voice data, and further adapted to Obtain the command response generated after the voice data is recognized according to the scene information, and perform the corresponding operation based on the command response.
  • a computing device including: at least one processor; and a memory storing program instructions, wherein the program instructions are configured to be executed by at least one processor, and the program instructions include Instructions for performing any of the methods described above.
  • a readable storage medium storing program instructions, which causes the computing device to perform any of the methods described above when the program instructions are read and executed by the computing device.
  • the client uploads the voice data input by the user to the server for recognition, and simultaneously uploads the scene information on the client as additional data to the server.
  • the scene information represents the current state of the client.
  • the server After the server initially recognizes the voice data, it also optimizes the initially recognized text based on the scene information, and finally obtains the recognized text. In this way, the recognition of voice data is closely combined with the current state of the client, which can significantly improve the accuracy of recognition.
  • FIG. 1 shows a schematic diagram of a scene of a voice data recognition system 100 according to an embodiment of the present invention
  • FIG. 2 shows a schematic diagram of a computing device 200 according to an embodiment of the invention
  • FIG. 3 shows an interaction flowchart of a voice data recognition method 300 according to an embodiment of the present invention
  • FIG. 4 shows a schematic diagram of a display interface of a client according to an embodiment of the present invention
  • FIG. 5 shows a schematic flowchart of a voice data recognition method 500 according to another embodiment of the present invention.
  • FIG. 6 shows a schematic diagram of a voice data recognition device 600 according to an embodiment of the present invention.
  • FIG. 1 shows a schematic diagram of a scene of a voice data recognition system 100 according to an embodiment of the present invention.
  • the system 100 includes a client 110 and a server 120.
  • the system 100 shown in FIG. 1 is only an example, and those skilled in the art can understand that in practical applications, the system 100 generally includes multiple clients 110 and servers 120.
  • the present invention treats the clients included in the system 100
  • the number of terminals 110 and servers 120 is not limited.
  • the client 110 is a device with a voice interaction module, which can receive voice instructions from the user and return voice or non-voice information to the user.
  • a typical voice interaction module includes a voice input unit such as a microphone, a voice output unit such as a speaker, and a processor.
  • the voice interaction module can be built in the client 110, or can be used as an independent module to cooperate with the client 110 (for example, through API or other means to communicate with the client 110, call the function or application interface on the client 110 Service), the embodiment of the present invention does not limit this.
  • the client 110 may be, for example, a mobile device (such as a smart speaker) with a voice interaction module, a smart robot, a smart home appliance (including a smart TV, a smart refrigerator, a smart microwave oven, etc.), etc., but is not limited thereto.
  • An application scenario of the client 110 is a home scenario, that is, the client 110 is placed in the user's home, and the user can issue a voice instruction to the client 110 to implement certain functions, such as surfing the Internet, on-demand songs, shopping, weather forecasting, and home Control of other smart home devices, etc.
  • the server 120 includes a voice data recognition device 600, which is used to provide a recognition service for the voice data received on the client 110 to obtain a textual representation of the voice data input by the user, and generate a representation of the user's intention based on the textual representation.
  • the command responds and returns to the client 110.
  • the client 110 receives the voice data input by the user, and transmits it to the server 120 together with the scene information on the client. It should be noted that the client 110 may also report to the server 120 when receiving the voice data input by the user, and the server 120 may pull the corresponding voice data and scene information from the client 110. The embodiments of the present invention do not limit this too much.
  • the server 120 cooperates with the client 110 to recognize the voice data according to the scene information to generate a corresponding recognition result.
  • the server 120 can also understand the user's intention through the recognition result, and generate a corresponding command response to the client 110, and the client 110 performs corresponding operations according to the command response to provide the user with corresponding services, such as setting an alarm clock and making a call , Send mail, broadcast information, play songs, videos, etc.
  • the client 110 may also output a corresponding voice response to the user according to the command response, which is not limited in the embodiment of the present invention.
  • the scene information of the client is, for example, the state where the user is operating an application or similar software on the client.
  • the user may be using an application to play video streaming data, and for example, the user is using a social software to communicate with a specific individual.
  • the client 110 receives the voice data input by the user, the client 110 transmits the above scene information to the server 120, so that the server 120 analyzes the voice data input by the user based on the scene information to accurately perceive the user's intention.
  • a speech data recognition scheme according to an embodiment of the present invention is outlined.
  • the smart speaker further includes: an interface unit and an interaction unit.
  • the interface unit obtains the voice data input by the user; the interaction unit responds to the user input voice data to obtain the current scene information of the smart speaker, and then, the interaction unit can also obtain the command response generated after recognizing the voice data according to the scene information , And perform the corresponding operation based on the command response.
  • the interface unit may transmit the acquired voice data together with the current scene information to the server 120, so that the server 120 recognizes the voice data according to the scene information and generates a recognition result of the voice data.
  • the server 120 will also generate an instruction response based on the recognition result and return it to the smart speaker (for the above-mentioned execution process of the server 120, please refer to the related description content in FIG. 3 below, which will not be expanded here).
  • the smart speaker Based on the command response, the smart speaker performs the corresponding operation and outputs it to the user.
  • related description contents in FIG. 1 and FIG. 3 for a more specific execution flow, reference may be made to related description contents in FIG. 1 and FIG. 3, and the embodiment of the present invention does not make too many restrictions on this.
  • the server 120 may also be implemented as other electronic devices (eg, other computing devices in the same IoT environment) connected to the client 110 through the network. Even when the client 110 (for example, a smart speaker) has sufficient storage space and computing power, the server 120 can also be implemented as the client 110 itself.
  • FIG. 2 shows a schematic diagram of a computing device 200 according to an embodiment of the invention.
  • the computing device 200 typically includes system memory 206 and one or more processors 204.
  • the memory bus 208 may be used for communication between the processor 204 and the system memory 206.
  • the processor 204 may be any type of processing, including but not limited to: a microprocessor ( ⁇ P), a microcontroller ( ⁇ C), a digital information processor (DSP), or any combination thereof.
  • the processor 204 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and a register 216.
  • the example processor core 214 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP core), or any combination thereof.
  • the example memory controller 218 may be used with the processor 204, or in some implementations, the memory controller 218 may be an internal part of the processor 204.
  • the system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof.
  • the system memory 206 may include an operating system 220, one or more applications 222, and program data 224.
  • the application 222 may be arranged to execute instructions by the one or more processors 204 using the program data 224 on the operating system.
  • the computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (eg, output device 242, peripheral interface 244, and communication device 246) to the basic configuration 202 via the bus/interface controller 230.
  • the example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices such as displays or speakers via one or more A/V ports 252.
  • the example peripheral interface 244 may include a serial interface controller 254 and a parallel interface controller 256, which may be configured to facilitate via one or more I/O ports 258 and such as input devices (eg, keyboard, mouse, pen) , Voice input devices, touch input devices) or other peripheral devices (such as printers, scanners, etc.) to communicate.
  • the example communication device 246 may include a network controller 260, which may be arranged to facilitate communication with one or more other computing devices 262 via a network communication link via one or more communication ports 264.
  • the network communication link may be an example of a communication medium.
  • Communication media can generally be embodied as computer readable instructions, data structures, program modules in a modulated data signal such as a carrier wave or other transmission mechanism, and can include any information delivery media.
  • a "modulated data signal" may be a signal in which one or more of its data set or its changes can be made in such a way as to encode information in the signal.
  • the communication medium may include a wired medium such as a wired network or a dedicated line network, and various wireless media such as sound, radio frequency (RF), microwave, infrared (IR), or other wireless media.
  • RF radio frequency
  • IR infrared
  • the term computer readable media as used herein may include both storage media and communication media.
  • the computing device 200 may be implemented as a server, such as a file server, a database server, an application server, a WEB server, etc., or as a personal computer including a desktop computer and a notebook computer configuration. Of course, the computing device 200 may also be implemented as part of a small-sized portable (or mobile) electronic device. In the embodiment according to the present invention, the computing device 200 is configured to perform the voice data recognition method according to the present invention.
  • the application 222 of the computing device 200 includes a plurality of program instructions for executing the method 300 according to the present invention.
  • FIG. 3 shows an interaction flowchart of a voice data recognition method 300 according to an embodiment of the present invention.
  • the identification method 300 is suitable for execution in the system 100 described above. As shown in FIG. 3, the method 300 starts at step S310.
  • step S310 the client 110 receives various voice data input by the user, and detects whether a predetermined object is included (the predetermined object is, for example, a predetermined wake-up word), and if the predetermined object is included, it is transmitted to the server 120.
  • the predetermined object is, for example, a predetermined wake-up word
  • the microphone of the voice interaction module continuously receives external sounds.
  • the user wants to use the client 110 for voice interaction, the user needs to say the corresponding wake-up word to wake up the client 110 first.
  • the client 110 is always in a working state, and the user needs to wake up the voice interaction module in the client 110 by entering a wake-up word.
  • Client 110 the microphone of the voice interaction module continuously receives external sounds.
  • the wake-up word can be set in advance when the client 110 is shipped from the factory, or can be set by the user during the use of the client 110.
  • the present invention does not limit the length and content of the wake-up word.
  • the wake word can be set to "elf”, “hello, elf”, and so on.
  • the client 110 may directly transmit the predetermined object to the server 120, or may transmit the voice data containing the predetermined object to the server 120 to inform the server 120 that the client 110 will be woken up. Subsequently, in step S320, after receiving the notification from the client 110, the server 120 confirms that the user wants to use the client 110 for voice interaction, the server 120 performs the corresponding wake-up process, and instructs the client 110 to enter the working state.
  • the instruction returned by the server 120 to the client 110 contains text data.
  • the text data returned by the server 120 is "Hello, please speak.”
  • the client 110 passes TextToSpeech(TTS , From text to voice) technology converts text data to voice data and plays it through the voice interaction module to inform the user that the client 110 has been woken up and can start voice interaction.
  • TTS TextToSpeech
  • the client 110 receives the voice data input by the user and forwards it to the server 120.
  • the client 110 in order to optimize the recognition process of voice data, the client 110 also collects scene information of the client 110 and forwards it to the server 120 when receiving voice data input by the user.
  • the client scene information may include any available information on the client.
  • the client scene information includes one or more of the following information: client process data, client application list, client application Usage history data, user personal data associated with the client, data obtained from the conversation history, at least one sensor from the client (such as light sensor, distance sensor, gravity sensor, acceleration sensor, GPS position sensor, temperature and humidity sensor, etc.)
  • the server 120 recognizes the voice data to obtain the recognition result of the voice data (in a preferred embodiment, the recognition result is expressed in text, but not limited to this), and the user's intention is analyzed based on this. According to an embodiment of the present invention, the server 120 completes the optimized recognition process in two steps, which are described below as step S340 and step S350.
  • step S340 the server 120 performs preliminary recognition on the voice data, and generates a first recognition result of the voice data.
  • the server 120 recognizes voice data through ASR (Automatic Speech Recognition) technology.
  • the server 120 may first express the voice data as text data, and then perform word segmentation processing on the text data to match to obtain the first recognition result.
  • a typical speech recognition method may be, for example, a method based on a vocal tract model and speech knowledge, a template matching method, and a method using a neural network, etc.
  • the embodiment of the present invention does not do much about which ASR technology is used for speech recognition Restrictions, any known or future known speech recognition algorithm can be combined with the embodiments of the present invention to implement the method 300 of the present invention.
  • the server 120 may also include some preprocessing operations on the voice data, such as: sampling, quantization, and removal of voice data that does not contain voice content (such as silent voice data), Framing, windowing and other processing of voice data, etc.
  • some preprocessing operations on the voice data such as: sampling, quantization, and removal of voice data that does not contain voice content (such as silent voice data), Framing, windowing and other processing of voice data, etc.
  • step S350 the server 120 recognizes the first recognition result according to the scene information, and generates a second recognition result of the voice data.
  • step S350 may be performed in two steps.
  • the current business scene of the client 110 is determined.
  • the business scenario of the client 110 characterizes the current business scenario that the client 110 is currently analyzing or based on user input.
  • Business scenes may include, for example, call scenes, short message scenes, song listening scenes, video playing scenes, web browsing scenes, and so on.
  • the server 120 analyzes from the scene information of the client 110 that the client 110 may be in a business scenario of a call (for example, there is a "dial keyboard” in the process that has been opened on the client 110), that is, the client's current
  • the business scenario is a call scenario.
  • the server 120 performs word segmentation processing on the first recognition result to obtain a keyword “call” that characterizes the user's actions.
  • the server 120 combines the keyword with the scene information of the client 110, and analyzes that the current business scenario is a call scenario.
  • the server 120 further recognizes the first recognition result according to the determined business scenario of the client 110 to generate a second recognition result of the voice data.
  • the server 120 extracts the entity to be determined in the first recognition result, such as "call Zhiwen” in the above example, and the server 120 obtains two entities after word segmentation: "call” and "zhiwen".
  • “calling” is a relatively definite action, and no longer acts as an entity to be determined. Therefore, in this example, "zhiwen” is used as the entity to be determined.
  • the server 120 may obtain multiple entities from the first recognition result through word segmentation and the like, and then extract one or more entities to be determined therefrom, which is not limited in the embodiment of the present invention.
  • the server 120 obtains at least one candidate entity from the client 110 according to the business scenario.
  • the server 120 obtains a list of contacts on the client 110, and uses the name of the contact in the list as a candidate entity.
  • the server 120 may also acquire each entity in the currently displayed page on the client 110 as a candidate entity. Can also be acquired as a candidate entity such as a song list, various application lists, memos, etc.
  • the selection of the candidate entity by the server 120 depends on the currently analyzed business scenario, and the embodiments of the present invention are not limited thereto.
  • the similarity values of each candidate entity and the entity to be determined are calculated separately, and a candidate entity with the largest similarity value is selected as the matched entity. It should be noted that any similarity calculation method can be combined with the embodiments of the present invention to achieve an optimized solution for speech data recognition.
  • the server 120 calculates the similarity value between each candidate entity in the contact list and the entity to be determined-"Zhi Wen", and finally determines the entity with the highest similarity value-"Soul” and uses “ “Soul” replaces "Zhiwen” and gets the second recognition result-"Call the soul".
  • step S360 the server 120 obtains a representation of the user's intention based on the generated second recognition result, and generates an instruction response, and then, the server 120 outputs the instruction response to the client 110 to instruct the client 110 to perform the corresponding operation.
  • the traditional voice data recognition scheme cannot distinguish the words entered by the user very well.
  • the client cannot accurately understand the user's intention, thereby affecting the user's use Experience.
  • the scene information on the client 110 is transmitted to the server 120 as additional data, so that the server 120 adds the constraint of the scene information when recognizing the voice data, so as to obtain a recognition result closer to the user's intention.
  • FIG. 4 shows a schematic diagram of the display interface on the client 110 according to an embodiment of the present invention.
  • the text data on the display interface of the client 110 is uploaded as scene information to the server 120, so that the user can directly input "hot movie", and the server 120 based on the scene data to the user Recognize the input voice (refer to the relevant descriptions of step S340 and step S350), and can accurately get the user's intention expression-"user wants to watch hot show movie", and convert it into "user click hot show movie” command response to
  • the client 110 implements the click operation on the display interface by the client 110.
  • the server 120 first performs a series of recognitions such as voice-to-text and word segmentation on the voice data to obtain the first Recognition results; combined with the scene information on the client 110, analyzes that the music playing application is being used on the client 110, that is, the current business scene of the client 110 may be a listening scene; at this time, the server 120 obtains the relevant information on the client 110
  • the song list of the linked account or the server 120 directly obtains the song list on the current display interface of the client 110
  • the second recognition result-"I want to hear it"-is obtained through analysis.
  • the recognition method 300 of the present invention can provide a user with a voice interaction experience of "what you see is what you see". That is, what the user sees from the client can be directly selected by inputting voice, which greatly simplifies the user's input operation and improves the user's interactive experience.
  • the client 110 when uploading the voice data input by the user to the server 120, the client 110 uploads the scene information on the client 110 (such as the foreground service of the client 110 and the text in the display interface) Etc.) are uploaded to the server 120 together as additional data. That is, the client 110 provides additional service data to the server 120 to optimize the recognition result.
  • the recognition of voice data is closely combined with the current state (or business scenario) of the client 110, which can significantly improve the accuracy of recognition.
  • the method 300 according to the present invention can also significantly improve the accuracy of subsequent natural language processing to accurately perceive the user's intention.
  • the execution of the method 300 involves various components in the system 100, where the server 120 is the focus of execution. For this reason, a flowchart of a method 500 for recognizing speech data according to another embodiment of the present invention is shown in FIG.
  • the method 500 shown in FIG. 5 is suitable for execution in the server 120 and is a further description of the method shown in FIG. 3.
  • the method 500 starts at step S510, and the server 120 acquires voice data and scene information of the client 110.
  • both voice data and scene information may be obtained from the client 110.
  • the scene information of the client can be information about the process in use on the client, or text information in the display interface on the client, or it can be the personal data of the user associated with the client (such as user information, user preferences) Etc.), or environmental information (such as local weather, local time, etc.) where the client is located, embodiments of the present invention are not limited thereto.
  • the scene information of the client includes at least one or more of the following data: the client's process data, the client's application list, the application usage history data on the client, the user's personal data associated with the client, Data obtained from the conversation history, data obtained from at least one sensor of the client, text data in the display page of the client, and input data provided by the user in advance.
  • step S310 the process of switching the client 110 (specifically, the voice interaction module on the client 110) from the sleep state to the working state according to the voice data input by the user is also included.
  • the process of switching the client 110 (specifically, the voice interaction module on the client 110) from the sleep state to the working state according to the voice data input by the user is also included.
  • the server 120 recognizes the voice data and generates a first recognition result of the voice data.
  • the server 120 may realize the recognition of the voice data through a method based on a vocal tract model and voice knowledge, a method of template matching, and a method using a neural network, etc., to generate a first recognition result, which is not implemented in the embodiments of the present invention Do too much restriction.
  • step S530 the server 120 recognizes (or optimizes) the first recognition result according to the scene information, and generates a second recognition result of the voice data.
  • the server 120 first determines the current business scene of the client based on the first recognition result and the scene information; and then recognizes the first recognition result according to the determined business scene to generate a second recognition of the voice data result.
  • the server 120 obtains a representation of the user's intention based on the generated second recognition result, and generates an instruction response, and then outputs the instruction response to the client 110 to instruct the client 110 to perform the corresponding operation.
  • the server 120 can perceive the user's intent in the current business scenario through any NLP algorithm, and the present invention does not limit this too much.
  • each step in the method 500 reference may be made to the relevant steps in the method 300 above (such as steps S340, S350, etc.), space is limited, and no more details are provided here.
  • FIG. 6 shows a schematic diagram of a voice data recognition device 600 residing in the server 120 according to an embodiment of the present invention.
  • the identification device 600 includes at least a connection management unit 610, a first processing unit 620 and a second processing unit 630.
  • the connection management unit 610 is used to implement various input/output operations of the recognition device 600, for example, to acquire voice data from the client 110 and scene information of the client 110.
  • the scene information of the client can be any information that can be obtained through the client, such as information about the process in use on the client, text information in the display interface on the client, and so on.
  • the scene information of the client includes at least one or more of the following data: the client's process data, the client's application list, the application usage history data on the client, the user's personal data associated with the client, Data obtained from the conversation history, data obtained from at least one sensor of the client, text data in the display page of the client, and input data provided by the user in advance.
  • the first processing unit 620 recognizes the voice data and generates a first recognition result of the voice data.
  • the second processing unit 630 recognizes the first recognition result according to the scene information, and generates a second recognition result of the voice data.
  • the second processing unit 630 further includes a business scene determination module 632 and an enhanced processing module 634.
  • the business scene determination module 632 determines the current business scene of the client 110 based on the first recognition result and the scene information; the enhanced processing module 634 recognizes the first recognition result according to the business scene to generate a second recognition result of voice data.
  • the enhanced processing module 634 may further include: an entity acquisition module 6342, a matching module 6344, and a generation module 6346.
  • the entity acquisition module 6342 is used to extract the entity to be determined in the first recognition result, and acquire at least one candidate entity from the client 110 according to the business scenario.
  • the matching module 6344 is configured to match an entity to be determined from at least one candidate entity.
  • the generating module 6346 is used to generate a second recognition result according to the matched entity.
  • the various technologies described herein may be implemented in combination with hardware or software, or a combination thereof.
  • the method and apparatus of the present invention or some aspects or parts of the method and apparatus of the present invention, may adopt embedded tangible media, such as a removable hard disk, U disk, floppy disk, CD-ROM, or any other machine-readable storage medium
  • program code ie, instructions
  • the machine becomes a device for practicing the invention.
  • the computing device In the case where the program code is executed on a programmable computer, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), and at least one input device, And at least one output device.
  • the memory is configured to store program code; the processor is configured to execute the method of the present invention according to the instructions in the program code stored in the memory.
  • readable media includes readable storage media and communication media.
  • the readable storage medium stores information such as computer readable instructions, data structures, program modules, or other data.
  • Communication media generally embodies computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transmission mechanism, and includes any information delivery media. Combinations of any of the above are also included within the scope of readable media.
  • the algorithm and display are not inherently related to any particular computer, virtual system, or other devices.
  • Various general-purpose systems can also be used with examples of the present invention. From the above description, the structure required to construct such a system is obvious.
  • the present invention is not directed to any particular programming language. It should be understood that various programming languages can be used to implement the contents of the present invention described herein, and the above descriptions of specific languages are for disclosure of the best embodiments of the present invention.
  • modules or units or components of the device in the examples disclosed herein may be arranged in the device as described in this embodiment, or alternatively may be positioned differently from the device in this example Of one or more devices.
  • the modules in the foregoing examples may be combined into one module or, in addition, may be divided into multiple sub-modules.
  • modules in the device in the embodiment can be adaptively changed and set in one or more devices different from the embodiment.
  • the modules or units or components in the embodiments may be combined into one module or unit or component, and in addition, they may be divided into a plurality of submodules or subunits or subcomponents. Except that at least some of such features and/or processes or units are mutually exclusive, all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any method so disclosed may be adopted in any combination All processes or units of equipment are combined. Unless expressly stated otherwise, each feature disclosed in this specification (including the accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose.
  • a processor having the necessary instructions for implementing the method or method element forms a device for implementing the method or method element.
  • the elements of the device embodiments described herein are examples of devices that are used to implement the functions performed by the elements for the purpose of implementing the invention.

Abstract

一种语音数据的识别方法、装置、系统以及相应的计算设备。其中,语音数据的识别方法包括步骤:获取语音数据和客户端的场景信息(S510);对语音数据进行识别,生成该语音数据的第一识别结果(S520);以及根据场景信息对第一识别结果进行识别,生成该语音数据的第二识别结果(S530)。

Description

一种语音数据的识别方法、装置及系统
本申请要求2018年12月11日递交的申请号为201811512516.9、发明名称为“一种语音数据的识别方法、装置及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及语音处理技术领域,尤其涉及一种语音数据的识别方法、装置及系统。
背景技术
过去十几年来,互联网在人们生活的各个领域不断深化,人们可以通过互联网方便地进行购物、社交、娱乐、理财等活动。同时,为提高用户体验,研究人员实现了很多交互方案,如文字输入、手势输入、语音输入等。其中,智能语音交互由于其操作的便捷性而成为新一代交互模式的研究热点。
当前,随着物联网及智能化的快速发展,市场上出现了一些智能语音设备,例如智能音箱、包含智能交互模块的各种智能电子设备(如移动设备、智能电视、智能冰箱等)。在一些使用场景中,智能语音设备可以通过语音识别技术来识别用户输入的语音数据,进而为用户提供个性化服务。在实际应用中,存在一些多音字、同音字和近音字,如“天下”、“甜虾”、“田霞”,传统的语音识别方案无法很好地区分这些词,这势必会影响用户的交互体验。
综上,保证语音数据识别的准确率是提高用户语音交互体验的一个非常重要的环节。
发明内容
为此,本发明提供了一种语音数据的识别方法、装置及系统,以力图解决或至少缓解上面存在的至少一个问题。
根据本发明的一个方面,提供了一种语音数据的识别方法,包括步骤:获取语音数据和客户端的场景信息;对语音数据进行识别,生成语音数据的第一识别结果;以及根据场景信息对第一识别结果进行识别,生成该语音数据的第二识别结果。
可选地,在根据本发明的方法中,根据场景信息对第一识别结果进行识别,生成语音数据的第二识别结果的步骤包括:基于第一识别结果和场景信息,确定客户端当前的业务场景;根据业务场景对第一识别结果进行识别,以生成该语音数据的第二识别结果。
可选地,在根据本发明的方法中,根据业务场景对第一识别结果进行识别,以生成语音数据的第二识别结果的步骤又包括:提取第一识别结果中的待确定实体;根据业务场景从客户端上获取至少一个候选实体;从至少一个候选实体中为待确定实体匹配到一个实体;以及根据所匹配到的实体生成第二识别结果。
可选地,根据本发明的方法还包括步骤:若语音数据中包含预定对象,则指示客户端进入工作状态。
可选地,根据本发明的方法还包括步骤:基于所生成的第二识别结果以得到用户意图的表示,并生成指令响应;输出该指令响应。
可选地,在根据本发明的方法中,场景信息包括下列信息中的一个或多个:客户端的进程数据、客户端的应用列表、客户端上应用使用历史数据、关联于该客户端的用户个人数据、从对话历史获得的数据、从客户端的至少一个传感器上获得的数据、客户端显示页面中的文本数据、由用户预先提供的输入数据。
可选地,在根据本发明的方法中,从至少一个候选实体中为待确定实体匹配到一个实体的步骤包括:分别计算至少一个候选实体与待确定实体的相似度值;以及选取相似度值最大的一个候选实体作为匹配到的实体。
根据本发明的另一个方面,提供了一种语音数据的识别方法,包括步骤:获取语音数据和客户端的场景信息;以及根据场景信息对语音数据进行识别,生成该语音数据的识别结果。
根据本发明的又一方面,提供了一种语音数据识别装置,包括:连接管理单元,适于获取来自客户端的语音数据和客户端的场景信息;第一处理单元,适于对语音数据进行识别,生成语音数据的第一识别结果;以及第二处理单元,适于根据场景信息对第一识别结果进行识别,生成该语音数据的第二识别结果。
可选地,在根据本发明的装置中,第二处理单元包括:业务场景确定模块,适于基于第一识别结果和场景信息,确定客户端当前的业务场景;增强处理模块,适于根据业务场景对第一识别结果进行识别,以生成该语音数据的第二识别结果。
可选地,在根据本发明的装置中,增强处理模块包括:实体获取模块,适于提取第一识别结果中的待确定实体,还适于根据业务场景从客户端上获取至少一个候选实体;匹配模块,适于从至少一个候选实体中为待确定实体匹配到一个实体;生成模块,适于根据所匹配到的实体生成第二识别结果。
可选地,在根据本发明的装置中,场景信息包括下列信息中的一个或多个:客户端 的进程数据、客户端的应用列表、客户端上应用使用历史数据、关联于该客户端的用户个人数据、从对话历史获得的数据、从客户端的至少一个传感器上获得的数据、客户端显示页面中的文本数据、由用户预先提供的输入数据。
根据本发明的再一个方面,提供了一种语音数据的识别系统,包括:客户端,适于接收用户的语音数据并传送给语音数据的识别装置;以及服务器,包括如上所述的语音数据识别装置,适于对来自客户端的语音数据进行识别,以生成相应的第二识别结果。
可选地,在根据本发明的系统中,语音数据识别装置还适于基于所生成的第二识别结果以得到用户意图的表示,并生成指令响应,还适于输出指令响应给客户端;以及客户端还适于根据指令响应执行相应操作。
可选地,在根据本发明的系统中,语音数据识别装置还适于在来自客户端的语音数据中包含预定对象时,指示客户端进入工作状态。
根据本发明的再一个方面,提供了一种智能音箱,包括:接口单元,适于获取用户输入的语音数据;交互单元,适于响应于用户输入语音数据,获取当前的场景信息,还适于获取根据场景信息对语音数据进行识别后生成的指令响应,并基于该指令响应,执行相应的操作。
根据本发明的再一个方面,提供了一种计算设备,包括:至少一个处理器;和存储有程序指令的存储器,其中,程序指令被配置为适于由至少一个处理器执行,程序指令包括用于执行如上所述任一方法的指令。
根据本发明的再一个方面,提供了一种存储有程序指令的可读存储介质,当程序指令被计算设备读取并执行时,使得计算设备执行如上所述的任一方法。
根据本发明的语音数据的识别方法,客户端在将用户输入的语音数据上传给服务器进行识别的同时,还会将客户端上的场景信息作为附加数据一并上传至服务器。这些场景信息表征了客户端当前的状态。服务器在对语音数据进行初步识别后,还会基于场景信息对初步识别后的文本进行优化,最终得到识别后的文本。这样,将对语音数据的识别与客户端的当前状态紧密结合,能够显著提升识别的准确率。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。
附图说明
为了实现上述以及相关目的,本文结合下面的描述和附图来描述某些说明性方面,这些方面指示了可以实践本文所公开的原理的各种方式,并且所有方面及其等效方面旨在落入所要求保护的主题的范围内。通过结合附图阅读下面的详细描述,本公开的上述以及其它目的、特征和优势将变得更加明显。遍及本公开,相同的附图标记通常指代相同的部件或元素。
图1示出了根据本发明一个实施例的语音数据的识别系统100的场景示意图;
图2示出了根据本发明一个实施例的计算设备200的示意图;
图3示出了根据本发明一个实施例的语音数据的识别方法300的交互流程图;
图4示出了根据本发明一个实施例的客户端的显示界面示意图;
图5示出了根据本发明另一个实施例的语音数据的识别方法500的流程示意图;以及
图6示出了根据本发明一个实施例的语音数据的识别装置600的示意图。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
图1示出了根据本发明一个实施例的语音数据的识别系统100的场景示意图。如图1所示,系统100中包括客户端110和服务器120。应当指出,图1所示的系统100仅作为一个示例,本领域技术人员可以理解,在实际应用中,系统100通常包括多个客户端110和服务器120,本发明对系统100中所包括的客户端110和服务器120的数量不做限制。
客户端110为具有语音交互模块的设备,其可以接收用户发出的语音指示,以及向用户返回语音或非语音信息。一个典型的语音交互模块包括麦克风等语音输入单元、扬声器等语音输出单元以及处理器。语音交互模块可以内置在客户端110中,也可以作为一个独立的模块与客户端110配合使用(例如经由API或通过其它方式与客户端110进行通信,调用客户端110上的功能或应用接口的服务),本发明的实施例对此不做限制。客户端110例如可以是具有语音交互模块的移动设备(如智能音箱)、智能机器人、智能家电(包括智能电视、智能冰箱、智能微波炉等)等,但不限于此。客户端110的一 个应用场景为家用场景,即,客户端110放置于用户家中,用户可以向客户端110发出语音指示以实现某些功能,例如上网、点播歌曲、购物、了解天气预报、对家中的其他智能家居设备进行控制,等等。
服务器120与客户端110通过网络进行通信,其例如可以是物理上位于一个或多个地点的云服务器。服务器120中包含语音数据识别装置600,用于为客户端110上接收的语音数据提供识别服务,以得到用户输入的语音数据的文本表示,以及,在基于文本表示得到用户意图的表示后,生成指令响应并返回给客户端110。
根据本发明的实施方式,客户端110接收用户输入的语音数据,并连同客户端上的场景信息一并传送给服务器120。应当指出,客户端110也可以在接收到用户输入的语音数据时,上报给服务器120,由服务器120向客户端110拉取相应的语音数据、场景信息。本发明的实施例对此不做过多限制。服务器120与客户端110相配合,根据场景信息对该语音数据进行识别,以生成相应的识别结果。服务器120还可以通过识别结果理解用户的意图,并生成相应的指令响应给客户端110,由客户端110根据该指令响应执行相应的操作,来为用户提供相应的服务,例如设置闹钟、拨打电话、发送邮件、播报资讯、播放歌曲、视频等。当然,客户端110也可以根据指令响应输出相应的语音响应给用户,本发明的实施例对此不做限制。
客户端的场景信息例如是用户正在操作客户端上的某个应用或者类似软件的状态。例如,用户可能正在使用某个应用播放视频流数据,又如,用户正在使用某个社交软件与特定个人进行交流。当客户端110接收到用户输入的语音数据时,客户端110将上述场景信息传送给服务器120,以便服务器120基于场景信息分析用户输入的语音数据,以准确地感知用户意图。
以下以客户端110被实现为智能音箱为例,概括说明根据本发明实施例的语音数据的识别方案。
除基本的配置外,根据本发明一个实施例的智能音箱还包括:接口单元和交互单元。其中,接口单元获取用户输入的语音数据;交互单元响应于用户输入语音数据,获取智能音箱当前的场景信息,而后,交互单元还能够获取根据该场景信息对该语音数据进行识别后生成的指令响应,并基于指令响应,执行相应的操作。
在一些实施例中,接口单元可以将获取的语音数据和当前的场景信息一并传送给服务器120,以便服务器120根据该场景信息对语音数据进行识别,生成该语音数据的识别结果,同时,服务器120还会基于识别结果生成指令响应,返回给智能音箱(关于服 务器120的上述执行过程,可参考下文中图3的相关描述内容,此处不做展开)。智能音箱基于指令响应,执行相应的操作并输出给用户。更具体的执行流程可以参考图1、图3中的相关描述内容,本发明的实施例对此不做过多限制。
应当指出,在根据本发明的另一些实施方式中,服务器120也可以实现为通过网络与客户端110相连的其他电子设备(如,同处于一个物联网环境中的其他计算设备)。甚至,当客户端110(如,智能音箱)具有足够的存储空间和算力的条件下,服务器120也可以实现为客户端110本身。
根据本发明的实施方式,客户端110和服务器120均可以通过如下所述的计算设备200来实现。图2示出了根据本发明一个实施例的计算设备200的示意图。
如图2所示,在基本的配置202中,计算设备200典型地包括系统存储器206和一个或者多个处理器204。存储器总线208可以用于在处理器204和系统存储器206之间的通信。
取决于期望的配置,处理器204可以是任何类型的处理,包括但不限于:微处理器(μP)、微控制器(μC)、数字信息处理器(DSP)或者它们的任何组合。处理器204可以包括诸如一级高速缓存210和二级高速缓存212之类的一个或者多个级别的高速缓存、处理器核心214和寄存器216。示例的处理器核心214可以包括运算逻辑单元(ALU)、浮点数单元(FPU)、数字信号处理核心(DSP核心)或者它们的任何组合。示例的存储器控制器218可以与处理器204一起使用,或者在一些实现中,存储器控制器218可以是处理器204的一个内部部分。
取决于期望的配置,系统存储器206可以是任意类型的存储器,包括但不限于:易失性存储器(诸如RAM)、非易失性存储器(诸如ROM、闪存等)或者它们的任何组合。系统存储器206可以包括操作系统220、一个或者多个应用222以及程序数据224。在一些实施方式中,应用222可以布置为在操作系统上由一个或多个处理器204利用程序数据224执行指令。
计算设备200还可以包括有助于从各种接口设备(例如,输出设备242、外设接口244和通信设备246)到基本配置202经由总线/接口控制器230的通信的接口总线240。示例的输出设备242包括图形处理单元248和音频处理单元250。它们可以被配置为有助于经由一个或者多个A/V端口252与诸如显示器或者扬声器之类的各种外部设备进行通信。示例外设接口244可以包括串行接口控制器254和并行接口控制器256,它们可以被配置为有助于经由一个或者多个I/O端口258和诸如输入设备(例如,键盘、鼠标、 笔、语音输入设备、触摸输入设备)或者其他外设(例如打印机、扫描仪等)之类的外部设备进行通信。示例的通信设备246可以包括网络控制器260,其可以被布置为便于经由一个或者多个通信端口264与一个或者多个其他计算设备262通过网络通信链路的通信。
网络通信链路可以是通信介质的一个示例。通信介质通常可以体现为在诸如载波或者其他传输机制之类的调制数据信号中的计算机可读指令、数据结构、程序模块,并且可以包括任何信息递送介质。“调制数据信号”可以是这样的信号,它的数据集中的一个或者多个或者它的改变可以在信号中编码信息的方式进行。作为非限制性的示例,通信介质可以包括诸如有线网络或者专线网络之类的有线介质,以及诸如声音、射频(RF)、微波、红外(IR)或者其它无线介质在内的各种无线介质。这里使用的术语计算机可读介质可以包括存储介质和通信介质二者。
计算设备200可以实现为服务器,例如文件服务器、数据库服务器、应用程序服务器和WEB服务器等,也可以实现为包括桌面计算机和笔记本计算机配置的个人计算机。当然,计算设备200也可以实现为小尺寸便携(或者移动)电子设备的一部分。在根据本发明的实施例中,计算设备200被配置为执行根据本发明的语音数据的识别方法。计算设备200的应用222中包含执行根据本发明的方法300的多条程序指令。
图3示出了根据本发明一个实施例的语音数据的识别方法300的交互流程图。该识别方法300适于在上述系统100中执行。如图3所示,方法300始于步骤S310。
在步骤S310中,客户端110接收用户输入的各种语音数据,并检测其中是否包含预定对象(预定对象例如是预定唤醒词),若包含预定对象则将其传送至服务器120。
在一个实施例中,客户端110中,语音交互模块的麦克风持续接收外部声音,当用户要使用客户端110进行语音交互时,需要先说出相应的唤醒词来唤醒客户端110。应当理解,在一些场景下,客户端110一直处于工作状态,用户需要通过输入唤醒词来唤醒客户端110中的语音交互模块,为便于说明,在本发明的实施例中统一记作:“唤醒客户端110”。
需要说明的是,唤醒词可以在客户端110出厂时预先设置,也可以由用户在使用客户端110的过程中自行设置,本发明对唤醒词的长短、内容均不做限制。例如,唤醒词可以被设置为“小精灵”,“你好,小精灵”,等等。
客户端110可以直接将预定对象传送至服务器120,也可以将包含预定对象的语音数据传送至服务器120,以告知服务器120,客户端110将被唤醒。随后在步骤S320中, 服务器120在接收到来自客户端110的通知后,确认用户要使用客户端110进行语音交互,服务器120执行相应的唤醒处理,并指示客户端110进入工作状态。
在一种实施例中,服务器120返回给客户端110的指示中包含文本数据,例如,服务器120返回的文本数据为“你好,请讲”,客户端110在接收到指示后通过TextToSpeech(TTS,从文本到语音)技术将文本数据转换为语音数据,并通过语音交互模块播放出来,以告知用户,客户端110已被唤醒,可以开始语音交互。
在客户端110被唤醒的状态下,在随后的步骤S330中,客户端110接收用户输入的语音数据,并将其转发至服务器120。
根据本发明的实施方式,为优化语音数据的识别过程,客户端110在接收到用户输入的语音数据时,还会收集客户端110的场景信息一并转发至服务器120。客户端的场景信息可以包括任意可以得到的客户端上的信息,在一些实施例中,客户端的场景信息包括下列信息中的一个或多个:客户端的进程数据、客户端的应用列表、客户端上应用使用历史数据、关联于该客户端的用户个人数据、从对话历史获得的数据、从客户端的至少一个传感器(如光线传感器、距离传感器、重力传感器、加速度传感器、GPS位置传感器、温湿度传感器等等)上获得的数据、客户端显示页面中的文本数据、由用户预先提供的输入数据,但不限于此。
随后,服务器120对语音数据进行识别,以得到该语音数据的识别结果(在一种优选的实施例中,识别结果采用文本表示,但不限于此),并基于此分析出用户意图。根据本发明的实施方式,服务器120分两步完成优化的识别过程,以下分为步骤S340和步骤S350来进行阐述。
在步骤S340中,服务器120对语音数据进行初步的识别,生成该语音数据的第一识别结果。
通常,服务器120通过ASR(Automatic Speech Recognition)技术对语音数据进行识别。服务器120可以先将语音数据表示为文本数据,再对文本数据进行分词处理,匹配得到第一识别结果。典型的语音识别方法例如可以是:基于声道模型和语音知识的方法、模板匹配的方法以及利用神经网络的方法等,本发明的实施例对采用何种ASR技术进行语音识别并不做过多限制,任何已知的或未来可知的语音识别算法均可以与本发明的实施例相结合,以实现本发明的方法300。
需要说明的是,服务器120在通过ASR技术进行识别时,还可以包括对语音数据的一些预处理操作,如:采样、量化、去除不包含语音内容的语音数据(如,静默的语音 数据)、对语音数据进行分帧、加窗等处理,等等。本发明的实施例在此处不做过多展开。
随后在步骤S350中,服务器120再根据场景信息对第一识别结果进行识别,生成该语音数据的第二识别结果。
根据本发明的实施方式,步骤S350又可以分两步来执行。
第一步,基于经步骤S340生成的第一识别结果和客户端110的场景信息,确定客户端110当前的业务场景。客户端110的业务场景表征的是客户端110当前或根据用户输入分析出的即将处于的业务场景。业务场景例如可以包含通话场景、短消息场景、听歌场景、播放视频场景、浏览网页场景等等。
假设用户通过客户端110输入的语音数据为——“打电话给之魂”,服务器120经初步的语音识别后,得到的第一识别结果为——“打电话给志文”。此时,服务器120从客户端110的场景信息中分析出客户端110可能处于通话的业务场景中(例如,客户端110上已经打开的进程中有“拨号键盘”),即判断客户端当前的业务场景为通话场景。或者,服务器120对第一识别结果进行分词处理后得到表征用户动作的关键词“打电话”,服务器120将关键词与客户端110的场景信息结合,分析得出当前的业务场景为通话场景。
第二步,服务器120根据所确定的客户端110的业务场景对第一识别结果进行进一步地识别,以生成该语音数据的第二识别结果。
(1)服务器120提取第一识别结果中的待确定实体,如上例中的“打电话给志文”,服务器120经分词处理后得到两个实体:“打电话”和“志文”。通常,“打电话”是一个较为确定的动作,不再作为待确定实体。故,在本例中,以“志文”作为待确定实体。服务器120可以通过分词等方式从第一识别结果中得到多个实体,而后从中提取出一个或多个待确定实体,本发明的实施例对此不做限制。
(2)服务器120根据业务场景从客户端110上获取至少一个候选实体。如上例,当确定客户端的业务场景为通话场景时,服务器120获取客户端110上的联系人列表,将列表中的联系人名称作为候选实体。应当指出,服务器120还可以获取客户端110上当前显示页面中的各实体作为候选实体。也可以获取诸如歌曲列表、各种应用列表、备忘录等作为候选实体。服务器120对候选实体的选择取决于当前所分析出的业务场景,本发明的实施例不限于此。
(3)从至少一个候选实体中为待确定实体匹配到一个实体。根据一种实施例,分别 计算各候选实体与待确定实体的相似度值,并选取相似度值最大的一个候选实体作为匹配到的实体。应当指出,任何相似度计算方法均可以与本发明的实施例相结合,以实现语音数据识别的优化方案。
(4)根据所匹配到的实体生成第二识别结果。根据一种实施例,用匹配到的实体替换第一识别结果中的待确定实体后,得到的文本就是最终的第二识别结果。还是以上例为例,服务器120计算联系人列表中各候选实体与待确定实体——“志文”之间的相似度值,最终确定出相似度值最高的实体——“之魂”,并用“之魂”替代“志文”,得到第二识别结果——“打电话给之魂”。
随后在步骤S360中,服务器120基于所生成的第二识别结果得到用户意图的表示,并生成指令响应,随后,服务器120输出该指令响应给客户端110,以指示客户端110执行相应操作。
在实际的应用场景中,由于多音字、同音字和近音字的存在,传统的语音数据识别方案无法很好地区分用户输入的词语,这样,客户端不能准确理解用户意图,进而影响用户的使用体验。根据本发明的实施方式,将客户端110上的场景信息作为附加数据一并传送给服务器120,以便服务器120在识别语音数据时,加入场景信息的约束,以得到更贴近用户意图的识别结果。
在另一些语音交互的实施例中,通常需要通过输入“下标”的方式来实现交互。如图4示出了根据本发明一个实施例的客户端110上的显示界面示意图。
可以将图4看作是一个视频网站的显示界面,在客户端110上呈现了多种与视频相关的应用图标(如精选、热播大剧、热映电影、综艺、动漫、体育、纪录片等),用户可以通过输入某个应用图标对应的词条来选中该应用图标,以实现“用户点击”的操作目的。但是,由于各应用图标对应的词条较短(多数只有一两个字),在语素较少的情况下,极大可能造成ASR识别率低,导致无法准确理解用户意图。因此,在现有的交互方案中,通常会为每个应用图标分配一个下标(如图4中所示,“精选”对应下标“1”、“热播大剧”对应下标“2”),由用户输入语音——“我选择第几个”,来进行交互。
但是当界面中应用图标很多、或者应用图标的布局不是很规则时,通过输入“下标”的方式来进行语音交互就不太方便,一方面增加了用户的学习负担,另一方面还可能误解用户意图,带来不够友好的用户体验。在根据本发明的实施方式中,将客户端110的显示界面上的文本数据作为场景信息上传给服务器120,这样,用户就可以直接输入——“热映电影”,服务器120基于场景数据对用户输入的语音进行识别(参见步骤S340 和步骤S350的相关描述),能够准确地得到用户意图的表示——“用户要观看热映电影”,并转化成“用户点击热映电影”的指令响应给客户端110,由客户端110在显示界面上实现该点击操作。
在再一些语音交互的实施例中,例如,用户在客户端110上输入语音数据——“我想听遇见”,服务器120先对语音数据进行语音转文本、分词等一系列识别后得到第一识别结果;再结合客户端110上的场景信息分析出客户端110上正在使用音乐播放应用,即,客户端110当前的业务场景可能是听歌场景;此时,服务器120获取客户端110上相关联账号的歌曲列表(或者说,服务器120直接就获取到客户端110当前显示界面上的歌曲列表),通过分析得出第二识别结果——“我想听遇见”。
基于上述描述,通过本发明的识别方法300,可以为用户提供“所见即可说”的语音交互体验。即,用户从客户端上看到什么,就可以直接通过输入语音的方式来进行选择,极大地简化用户的输入操作,提高用户的交互体验。
根据本发明的对语音数据的识别方法300,客户端110在将用户输入的语音数据上传给服务器120时,将客户端110上的场景信息(如客户端110的前台业务、显示界面中的文本等)作为附加数据一并上传至服务器120。即,客户端110提供额外的业务数据给服务器120,以优化识别的结果。这样,将对语音数据的识别与客户端110的当前状态(或业务场景)紧密结合,能够显著提升识别的准确率。在整个语音交互的处理过程中,根据本发明的方法300还能够显著提升后续的自然语言处理的准确率,以准确感知用户的意图。
方法300的执行涉及到系统100中的各个部件,其中,服务器120作为执行的重点,为此,在图5中示出了根据本发明另一个实施例的语音数据的识别方法500的流程示意图。图5所示的方法500适于在服务器120中执行,是图3所示方法的进一步说明。
在图5中,方法500始于步骤S510,服务器120获取语音数据和客户端110的场景信息。在根据本发明的一些实施例中,语音数据和场景信息均可以是从客户端110上获取的。客户端的场景信息可以是客户端上的正在使用进程的相关信息、也可以是客户端上显示界面中的文本信息、还可以是关联在客户端上的用户的个人数据(如用户信息、用户偏好等)、也可以是客户端所处位置的环境信息(如本地天气、本地时间等),本发明的实施例不限于此。在一种实施例中,客户端的场景信息至少包括以下数据中的一种或多种:客户端的进程数据、客户端的应用列表、客户端上应用使用历史数据、关联于该客户端的用户个人数据、从对话历史获得的数据、从客户端的至少一个传感器上获 得的数据、客户端显示页面中的文本数据、由用户预先提供的输入数据。
当然,在获取语音数据之前,还包括根据用户输入的语音数据将客户端110(具体来说,是客户端110上的语音交互模块)由休眠状态切换到工作状态的过程。具体可参考上文中关于步骤S310和步骤S320的相关描述。
随后在步骤S520中,服务器120对语音数据进行识别,生成该语音数据的第一识别结果。服务器120可以通过例如基于声道模型和语音知识的方法、模板匹配的方法以及利用神经网络的方法等来实现对该语音数据的识别,以生成第一识别结果,本发明的实施例对此不做过多限制。
随后在步骤S530中,服务器120根据场景信息对第一识别结果进行识别(或者说是优化处理),生成该语音数据的第二识别结果。根据一种实施例,服务器120先基于第一识别结果和场景信息,确定客户端当前的业务场景;再根据所确定的业务场景对第一识别结果进行识别,以生成该语音数据的第二识别结果。
最后,服务器120基于所生成的第二识别结果得到用户意图的表示,并生成指令响应,随后,输出该指令响应给客户端110,以指示客户端110执行相应操作。服务器120可以通过任何NLP算法来感知当前业务场景下的用户意图,本发明对此不做过多限制。
关于方法500中各步骤的具体描述可参考前文方法300中的相关步骤(如步骤S340、S350等),篇幅所限,此处不再进行赘述。
为配合图3~图5的相关描述进一步说明服务器120,图6示出了根据本发明一个实施例的驻留在服务器120中的语音数据识别装置600的示意图。
如图6所示,识别装置600至少包括:连接管理单元610、第一处理单元620和第二处理单元630。
连接管理单元610用于实现识别装置600的各种输入/输出操作,例如,获取来自客户端110的语音数据和客户端110的场景信息。如前文所述,客户端的场景信息可以是通过客户端可获取的任何信息,如客户端上的正在使用进程的相关信息、客户端上显示界面中的文本信息,等等。在一种实施例中,客户端的场景信息至少包括以下数据中的一种或多种:客户端的进程数据、客户端的应用列表、客户端上应用使用历史数据、关联于该客户端的用户个人数据、从对话历史获得的数据、从客户端的至少一个传感器上获得的数据、客户端显示页面中的文本数据、由用户预先提供的输入数据。
第一处理单元620对语音数据进行识别,生成该语音数据的第一识别结果。第二处理单元630根据场景信息对第一识别结果进行识别,生成该语音数据的第二识别结果。
根据一种实施例,第二处理单元630又包括业务场景确定模块632和增强处理模块634。其中,业务场景确定模块632基于第一识别结果和场景信息,确定客户端110当前的业务场景;增强处理模块634根据业务场景对第一识别结果进行识别,以生成语音数据的第二识别结果。
进一步地,增强处理模块634又可以包括:实体获取模块6342、匹配模块6344和生成模块6346。实体获取模块6342用于提取出第一识别结果中的待确定实体,以及根据业务场景从客户端110上获取至少一个候选实体。匹配模块6344用于从至少一个候选实体中为待确定实体匹配到一个实体。生成模块6346用于根据所匹配到的实体生成第二识别结果。
关于识别装置600中各部分所执行操作的具体描述可参见前文关于图1、图3的相关内容,此处不再赘述。
这里描述的各种技术可结合硬件或软件,或者它们的组合一起实现。从而,本发明的方法和设备,或者本发明的方法和设备的某些方面或部分可采取嵌入有形媒介,例如可移动硬盘、U盘、软盘、CD-ROM或者其它任意机器可读的存储介质中的程序代码(即指令)的形式,其中当程序被载入诸如计算机之类的机器,并被所述机器执行时,所述机器变成实践本发明的设备。
在程序代码在可编程计算机上执行的情况下,计算设备一般包括处理器、处理器可读的存储介质(包括易失性和非易失性存储器和/或存储元件),至少一个输入装置,和至少一个输出装置。其中,存储器被配置用于存储程序代码;处理器被配置用于根据该存储器中存储的所述程序代码中的指令,执行本发明的方法。
以示例而非限制的方式,可读介质包括可读存储介质和通信介质。可读存储介质存储诸如计算机可读指令、数据结构、程序模块或其它数据等信息。通信介质一般以诸如载波或其它传输机制等已调制数据信号来体现计算机可读指令、数据结构、程序模块或其它数据,并且包括任何信息传递介质。以上的任一种的组合也包括在可读介质的范围之内。
在此处所提供的说明书中,算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与本发明的示例一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本发明也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本发明的内容,并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。
本领域那些技术人员应当理解在本文所公开的示例中的设备的模块或单元或组件可以布置在如该实施例中所描述的设备中,或者可替换地可以定位在与该示例中的设备不同的一个或多个设备中。前述示例中的模块可以组合为一个模块或者此外可以分成多个子模块。
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。
此外,所述实施例中的一些在此被描述成可以由计算机系统的处理器或者由执行所述功能的其它装置实施的方法或方法元素的组合。因此,具有用于实施所述方法或方法元素的必要指令的处理器形成用于实施该方法或方法元素的装置。此外,装置实施例的在此所述的元素是如下装置的例子:该装置用于实施由为了实施该发明的目的的元素所 执行的功能。
如在此所使用的那样,除非另行规定,使用序数词“第一”、“第二”、“第三”等等来描述普通对象仅仅表示涉及类似对象的不同实例,并且并不意图暗示这样被描述的对象必须具有时间上、空间上、排序方面或者以任意其它方式的给定顺序。
尽管根据有限数量的实施例描述了本发明,但是受益于上面的描述,本技术领域内的技术人员明白,在由此描述的本发明的范围内,可以设想其它实施例。此外,应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的而非限制性的,本发明的范围由所附权利要求书限定。

Claims (22)

  1. 一种语音数据的识别方法,包括步骤:
    获取语音数据和客户端的场景信息;
    对所述语音数据进行识别,生成语音数据的第一识别结果;以及
    根据所述场景信息对所述第一识别结果进行识别,生成所述语音数据的第二识别结果。
  2. 如权利要求1所述的方法,其中,所述根据场景信息对第一识别结果进行识别,生成语音数据的第二识别结果的步骤包括:
    基于所述第一识别结果和场景信息,确定所述客户端当前的业务场景;
    根据所述业务场景对所述第一识别结果进行识别,以生成所述语音数据的第二识别结果。
  3. 如权利要求2所述的方法,其中,所述根据业务场景对第一识别结果进行识别,以生成语音数据的第二识别结果的步骤包括:
    提取所述第一识别结果中的待确定实体;
    根据业务场景从客户端上获取至少一个候选实体;
    从所述至少一个候选实体中为所述待确定实体匹配到一个实体;以及
    根据所匹配到的实体生成所述第二识别结果。
  4. 如权利要求1-3中任一项所述的方法,在所述获取语音数据和客户端的场景信息的步骤之前,还包括步骤:
    若语音数据中包含预定对象,则指示客户端进入工作状态。
  5. 如权利要求1-3中任一项所述的方法,在生成语音数据的第二识别结果之后,还包括步骤:
    基于所生成的第二识别结果得到用户意图的表示,并生成指令响应;
    输出所述指令响应。
  6. 如权利要求4所述的方法,在生成语音数据的第二识别结果之后,还包括步骤:
    基于所生成的第二识别结果得到用户意图的表示,并生成指令响应;
    输出所述指令响应。
  7. 如权利要求1-3中任一项所述的方法,其中,所述场景信息包括下列信息中的一个或多个:客户端的进程数据、客户端的应用列表、客户端上应用使用历史数据、关联于该客户端的用户个人数据、从对话历史获得的数据、从客户端的至少一个传感器上获 得的数据、客户端显示页面中的文本数据、由用户预先提供的输入数据。
  8. 如权利要求4所述的方法,其中,所述场景信息包括下列信息中的一个或多个:客户端的进程数据、客户端的应用列表、客户端上应用使用历史数据、关联于该客户端的用户个人数据、从对话历史获得的数据、从客户端的至少一个传感器上获得的数据、客户端显示页面中的文本数据、由用户预先提供的输入数据。
  9. 如权利要求5所述的方法,其中,所述场景信息包括下列信息中的一个或多个:客户端的进程数据、客户端的应用列表、客户端上应用使用历史数据、关联于该客户端的用户个人数据、从对话历史获得的数据、从客户端的至少一个传感器上获得的数据、客户端显示页面中的文本数据、由用户预先提供的输入数据。
  10. 如权利要求6所述的方法,其中,所述场景信息包括下列信息中的一个或多个:客户端的进程数据、客户端的应用列表、客户端上应用使用历史数据、关联于该客户端的用户个人数据、从对话历史获得的数据、从客户端的至少一个传感器上获得的数据、客户端显示页面中的文本数据、由用户预先提供的输入数据。
  11. 如权利要求3所述的方法,其中,所述从所述至少一个候选实体中为待确定实体匹配到一个实体的步骤包括:
    分别计算所述至少一个候选实体与待确定实体的相似度值;以及
    选取相似度值最大的一个候选实体作为匹配到的实体。
  12. 一种语音数据的识别方法,包括步骤:
    获取语音数据和客户端的场景信息;以及
    根据所述场景信息对所述语音数据进行识别,生成所述语音数据的识别结果。
  13. 一种语音数据识别装置,包括:
    连接管理单元,适于获取语音数据和客户端的场景信息;
    第一处理单元,适于对所述语音数据进行识别,生成所述语音数据的第一识别结果;以及
    第二处理单元,适于根据所述场景信息对所述第一识别结果进行识别,生成所述语音数据的第二识别结果。
  14. 如权利要求13所述的装置,其中,所述第二处理单元包括:
    业务场景确定模块,适于基于所述第一识别结果和场景信息,确定所述客户端当前的业务场景;
    增强处理模块,适于根据所述业务场景对所述第一识别结果进行识别,以生成所述 语音数据的第二识别结果。
  15. 如权利要求14所述的装置,其中,所述增强处理模块包括:
    实体获取模块,适于提取所述第一识别结果中的待确定实体,还适于根据业务场景从客户端上获取至少一个候选实体;
    匹配模块,适于从所述至少一个候选实体中为所述待确定实体匹配到一个实体;
    生成模块,适于根据所匹配到的实体生成第二识别结果。
  16. 如权利要求13-15中任一项所述的装置,其中,所述场景信息包括下列信息中的一个或多个:客户端的进程数据、客户端的应用列表、客户端上应用使用历史数据、关联于该客户端的用户个人数据、从对话历史获得的数据、从客户端的至少一个传感器上获得的数据、客户端显示页面中的文本数据、由用户预先提供的输入数据。
  17. 一种语音数据的识别系统,包括:
    客户端,适于接收用户的语音数据并传送给语音数据的识别装置;以及
    服务器,包括如权利要求13-15中任一项所述的语音数据识别装置,适于对来自所述客户端的语音数据进行识别,以生成相应的第二识别结果。
  18. 如权利要求17所述的系统,其中,
    所述语音数据识别装置还适于基于所生成的第二识别结果以得到用户意图的表示,并生成指令响应,还适于输入所述指令响应给所述客户端;以及
    所述客户端还适于根据所述指令响应执行相应操作。
  19. 如权利要求17或18所述的系统,其中,
    所述语音数据识别装置还适于在来自客户端的语音数据中包含预定对象时,指示所述客户端进入工作状态。
  20. 一种智能音箱,包括:
    接口单元,适于获取用户输入的语音数据;
    交互单元,适于响应于用户输入语音数据,获取当前的场景信息,还适于获取根据所述场景信息对语音数据进行识别后生成的指令响应,并基于所述指令响应,执行相应的操作。
  21. 一种计算设备,包括:
    至少一个处理器;和
    存储有程序指令的存储器,其中,所述程序指令被配置为适于由所述至少一个处理器执行,所述程序指令包括用于执行如权利要求1-12中任一项所述方法的指令。
  22. 一种存储有程序指令的可读存储介质,当所述程序指令被计算设备读取并执行时,使得所述计算设备执行如权利要求1-12中任一项所述的方法。
PCT/CN2019/122933 2018-12-11 2019-12-04 一种语音数据的识别方法、装置及系统 WO2020119541A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811512516.9A CN111312233A (zh) 2018-12-11 2018-12-11 一种语音数据的识别方法、装置及系统
CN201811512516.9 2018-12-11

Publications (1)

Publication Number Publication Date
WO2020119541A1 true WO2020119541A1 (zh) 2020-06-18

Family

ID=71075329

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/122933 WO2020119541A1 (zh) 2018-12-11 2019-12-04 一种语音数据的识别方法、装置及系统

Country Status (3)

Country Link
CN (1) CN111312233A (zh)
TW (1) TW202022849A (zh)
WO (1) WO2020119541A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767021A (zh) * 2020-06-28 2020-10-13 广州小鹏车联网科技有限公司 语音交互方法、车辆、服务器、系统和存储介质
CN112053688B (zh) * 2020-08-27 2024-03-08 海信视像科技股份有限公司 一种语音交互方法及交互设备、服务器
CN112309399B (zh) * 2020-10-30 2023-02-24 上海淇玥信息技术有限公司 一种基于语音执行任务的方法、装置和电子设备
CN112466289A (zh) * 2020-12-21 2021-03-09 北京百度网讯科技有限公司 语音指令的识别方法、装置、语音设备和存储介质
CN113655266B (zh) * 2021-09-26 2024-01-26 合肥美的暖通设备有限公司 一种单母线电流检测方法、装置、电机控制器及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102074231A (zh) * 2010-12-30 2011-05-25 万音达有限公司 语音识别方法和语音识别系统
CN105139858A (zh) * 2015-07-27 2015-12-09 联想(北京)有限公司 一种信息处理方法及电子设备
CN105225665A (zh) * 2015-10-15 2016-01-06 桂林电子科技大学 一种语音识别方法及语音识别装置
CN108305633A (zh) * 2018-01-16 2018-07-20 平安科技(深圳)有限公司 语音验证方法、装置、计算机设备和计算机可读存储介质
US20180232688A1 (en) * 2017-02-10 2018-08-16 Vocollect, Inc. Method and system for inputting products into an inventory system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106683662A (zh) * 2015-11-10 2017-05-17 中国电信股份有限公司 一种语音识别方法和装置
CN105719649B (zh) * 2016-01-19 2019-07-05 百度在线网络技术(北京)有限公司 语音识别方法及装置
US10366122B2 (en) * 2016-09-14 2019-07-30 Ants Technology (Hk) Limited. Methods circuits devices systems and functionally associated machine executable code for generating a searchable real-scene database
CN108304368B (zh) * 2017-04-20 2022-02-08 腾讯科技(深圳)有限公司 文本信息的类型识别方法和装置及存储介质和处理器
CN107797984B (zh) * 2017-09-11 2021-05-14 远光软件股份有限公司 智能交互方法、设备及存储介质
CN107644642B (zh) * 2017-09-20 2021-01-15 Oppo广东移动通信有限公司 语义识别方法、装置、存储介质及电子设备
CN108521500A (zh) * 2018-03-13 2018-09-11 努比亚技术有限公司 一种语音场景控制方法、设备及计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102074231A (zh) * 2010-12-30 2011-05-25 万音达有限公司 语音识别方法和语音识别系统
CN105139858A (zh) * 2015-07-27 2015-12-09 联想(北京)有限公司 一种信息处理方法及电子设备
CN105225665A (zh) * 2015-10-15 2016-01-06 桂林电子科技大学 一种语音识别方法及语音识别装置
US20180232688A1 (en) * 2017-02-10 2018-08-16 Vocollect, Inc. Method and system for inputting products into an inventory system
CN108305633A (zh) * 2018-01-16 2018-07-20 平安科技(深圳)有限公司 语音验证方法、装置、计算机设备和计算机可读存储介质

Also Published As

Publication number Publication date
TW202022849A (zh) 2020-06-16
CN111312233A (zh) 2020-06-19

Similar Documents

Publication Publication Date Title
WO2020119541A1 (zh) 一种语音数据的识别方法、装置及系统
US10650816B2 (en) Performing tasks and returning audio and visual feedbacks based on voice command
US20200349940A1 (en) Server for determining target device based on speech input of user and controlling target device, and operation method of the server
KR102309540B1 (ko) 사용자의 입력 입력에 기초하여 타겟 디바이스를 결정하고, 타겟 디바이스를 제어하는 서버 및 그 동작 방법
WO2020119542A1 (zh) 一种语音交互方法、装置及系统
TWI511125B (zh) 語音操控方法、行動終端裝置及語音操控系統
WO2020119569A1 (zh) 一种语音交互方法、装置及系统
US20140207811A1 (en) Electronic device for determining emotion of user and method for determining emotion of user
CN110634483A (zh) 人机交互方法、装置、电子设备及存储介质
US11457061B2 (en) Creating a cinematic storytelling experience using network-addressable devices
CN108055617B (zh) 一种麦克风的唤醒方法、装置、终端设备及存储介质
WO2019101099A1 (zh) 视频节目识别方法、设备、终端、系统和存储介质
CN107424612B (zh) 处理方法、装置和机器可读介质
WO2023093280A1 (zh) 语音控制方法、装置、电子设备及存储介质
KR20210019924A (ko) 음성 인식 결과를 수정하는 시스템 및 방법
US20220270604A1 (en) Electronic device and operation method thereof
US11150923B2 (en) Electronic apparatus and method for providing manual thereof
CN113076397A (zh) 意图识别方法、装置、电子设备及存储介质
US20200243084A1 (en) Electronic device and control method therefor
US20220319497A1 (en) Electronic device and operation method thereof
CN117806587A (zh) 显示设备和多轮对话预料生成方法
CN115438625A (zh) 文本纠错服务器、终端设备及文本纠错方法
CN114385945A (zh) 一种生成显示页面的方法及装置
CN117809649A (zh) 显示设备和语义分析方法
CN117809644A (zh) 电子设备、语音识别方法、装置及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19896983

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19896983

Country of ref document: EP

Kind code of ref document: A1