WO2019071607A1 - 一种语音信息处理方法、装置及终端 - Google Patents

一种语音信息处理方法、装置及终端 Download PDF

Info

Publication number
WO2019071607A1
WO2019071607A1 PCT/CN2017/106168 CN2017106168W WO2019071607A1 WO 2019071607 A1 WO2019071607 A1 WO 2019071607A1 CN 2017106168 W CN2017106168 W CN 2017106168W WO 2019071607 A1 WO2019071607 A1 WO 2019071607A1
Authority
WO
WIPO (PCT)
Prior art keywords
event
text information
probability
terminal
domain
Prior art date
Application number
PCT/CN2017/106168
Other languages
English (en)
French (fr)
Inventor
隋志成
李艳明
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP17928115.9A priority Critical patent/EP3686758A4/en
Priority to CN201780091549.8A priority patent/CN110720104B/zh
Priority to AU2017435621A priority patent/AU2017435621B2/en
Priority to US16/754,540 priority patent/US11308965B2/en
Publication of WO2019071607A1 publication Critical patent/WO2019071607A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning

Definitions

  • the embodiments of the present invention relate to the field of computer technologies, and in particular, to a voice information processing method, apparatus, and terminal.
  • the functions of intelligent terminals are more and more, for example, the terminal can provide voice dialogue function for the user, that is, the terminal can receive voice information input by the user (such as "open map application"), and semantically perform the voice information. Understand, and then perform the event corresponding to the semantic understanding result (open the map application in the terminal, such as Baidu map).
  • voice information input by the user such as "open map application”
  • semantically perform the voice information Understand, and then perform the event corresponding to the semantic understanding result (open the map application in the terminal, such as Baidu map).
  • the terminal may send the received voice information to the cloud server, and the cloud server performs semantic understanding on the voice information to obtain a semantic understanding result; then, the cloud server may instruct the terminal to perform semantic understanding results. Corresponding event.
  • the terminal needs to perform at least two data interactions with the cloud server, and the data interaction between the terminal and the cloud server may cause the terminal to fail to perform an event corresponding to the semantic understanding result in time due to a network failure or the like. Moreover, since the amount of voice information is generally large, a large amount of network traffic is consumed.
  • the embodiment of the present invention provides a voice information processing method, device, and terminal, which can save network traffic consumed by a cloud server for semantic understanding.
  • the embodiment of the present application provides a voice information processing method, where the voice information processing method includes: receiving, by a terminal, voice information, and converting the voice information into text information; preset M event fields in the terminal; acquiring the text The information belongs to the domain probability of each event field in the M event areas, and the domain information attributed to the field of the event information is used to represent the possibility that the text information belongs to the event field; obtaining the text information belongs to the N The prior probability of each event field in the event field, the prior probability of the text information attributed to an event field is used to characterize the probability that the text information belongs to the event field according to the multiple semantic understandings that have been made.
  • the N event fields are the N event fields in the M event fields, where N is less than or equal to M; and the confidence that the text information belongs to each of the N event fields is obtained, and the text information belongs to The confidence level of an event field is used to characterize the conviction that the above textual information belongs to the field of the event.
  • the output is based on an event with the highest probability value among the N probability values.
  • the semantic understanding result of the semantic understanding of the text information in the domain may be replaced by: the semantic understanding result of the semantic understanding of the text information according to the event domain with the highest probability value among the above N probability values is taken as the final semantic understanding result.
  • the text information belongs to an a priori probability of an event field: a probability used to represent the history data, the text information belongs to the event domain; the text information belongs to an event domain domain probability: used to represent the text information belongs to The likelihood of the field of the event; the confidence that the textual information belongs to an event domain: the degree of confidence that the textual information is attributed to the field of the event.
  • the embodiment of the present application not only refers to the domain probability obtained by analyzing the vocabulary included in the text information, but also refers to the prior probability that the text information belongs to the event domain, and is used to represent the domain probability.
  • the compliance of the executed events can improve the user experience.
  • the terminal may select an event field in which the domain probability is ranked in the top N according to the domain probability from high to low.
  • the terminal after selecting the N event fields from the M event fields, the terminal only needs to calculate the prior probability and the confidence that the text information belongs to the N event fields, without calculating the text information belonging to the M events.
  • the prior probability and confidence of all event areas in the domain can reduce the amount of computation when the terminal performs voice information processing and improve computational efficiency.
  • the method in the embodiment of the present application further includes: the terminal is in the N events.
  • the semantics of the above text information are separately understood, and N semantic understanding results are obtained.
  • the terminal can transmit the text information to the dialogue engine of the identified event domain, and the dialogue engine semantically understands the text information to obtain a semantic understanding result.
  • the present embodiment may not limit the order in which the terminal performs domain identification and semantic understanding, and may perform domain identification and semantic understanding at the same time or substantially simultaneously, or perform domain recognition after semantic understanding.
  • each of the M event fields corresponds to a keyword model
  • the keyword model includes: multiple keywords corresponding to the event field.
  • the method may include: performing, by the terminal, word segmentation processing on the text information, and extracting at least one word segment; acquiring the at least one word segmentation
  • Corresponding keywords are distributed information in a keyword model of each of the event fields; and based on the distribution information, a confidence level of the text information attributed to each of the N event fields is calculated.
  • the terminal acquires a domain probability that the text information belongs to each event field in the M event fields, and includes: the terminal performs word segmentation on the text information, and extracts at least one word segment; Searching for a feature corresponding to the at least one word segment in the database model corresponding to each event field, wherein the database model includes a plurality of features, a weight of each feature, and a word segment corresponding to each feature, where the weight is used to indicate that the weight corresponds to The probability that the feature belongs to the corresponding event domain in the above database model; wherein each event domain corresponds to a database model; according to the database schema corresponding to each event domain from above The weight of the feature found in the type, and calculating the domain probability that the text information belongs to each of the event fields described above.
  • the same participle has the same feature in the database model of different event domains, that is, in the feature database, the feature of the segmentation can uniquely identify the segmentation.
  • the same participle has different weights in different event areas.
  • each of the M event fields corresponds to a keyword model
  • the keyword model includes: a plurality of keywords and each keyword indicating that the text information belongs to the above The probability of the event field corresponding to the keyword model.
  • At least one keyword may include keywords in a keyword model of each event domain, and each keyword is in a keyword model of a different event domain, the probability that the text information belongs to the corresponding event domain may be indicated. Therefore, based on the probability indicated by the keywords of the respective event fields included in the text information, the domain probability at which the text information belongs to each event field can be calculated.
  • the method of the embodiment of the present application may further include: after the terminal outputs the semantic understanding result after the appeal, performing the operation corresponding to the semantic understanding result according to the semantic understanding result.
  • the embodiment of the present application provides a voice information processing apparatus, where the voice information processing apparatus includes: a receiving unit, a converting unit, a first acquiring unit, a second acquiring unit, a third acquiring unit, a calculating unit, and an output unit.
  • the receiving unit is configured to receive voice information.
  • the conversion unit is configured to convert the voice information received by the receiving unit into text information, and preset M event fields in the terminal.
  • the first acquiring unit is configured to obtain a domain probability that the text information converted by the converting unit belongs to each event domain in the M event areas, and the domain probability is used to represent the possibility that the text information belongs to an event field. Sex.
  • the second acquiring unit is configured to acquire a prior probability that the text information converted by the converting unit belongs to each of the N event fields, and the prior probability is used to represent multiple semantics according to the performed It is understood that the probability that the text information belongs to an event field is determined, and the N event fields are N event fields in the M event fields, and N is less than or equal to M.
  • the third obtaining unit is configured to obtain a confidence that the text information converted by the converting unit belongs to each of the N event areas, and the confidence level is used to represent that the text information belongs to an event field. Confidence.
  • the calculating unit is configured to: the domain probability obtained by the first acquiring unit, the domain probability of each of the N event areas, the prior probability acquired by the second acquiring unit, and the third acquiring unit
  • the obtained confidence degree is calculated by calculating N probability values that the text information belongs to the N event fields respectively.
  • the output unit is configured to output a semantic understanding result of semantically understanding the text information in the event domain with the highest probability value among the N probability values calculated by the calculating unit.
  • N when N is less than M, the above N event fields are in the preset M event fields, and the domain probability is arranged in the top N N event fields in descending order. N ⁇ 2.
  • the above voice information processing apparatus further includes: a semantic understanding unit.
  • the semantic understanding unit is configured to perform semantics on the text information in the N event fields after the first acquiring unit acquires the domain probability of each of the event areas in the M event fields. Understand, get N semantic understanding results.
  • the above voice information processing apparatus further includes: a storage unit.
  • the storage unit is configured to save a keyword model corresponding to each of the M event areas, where the keyword model includes: a plurality of keywords corresponding to the event field.
  • the third acquiring unit is specifically configured to: perform word segmentation processing on the text information, and extract at least one word segment; and acquire a keyword corresponding to the at least one word segment in a keyword model of each event field saved by the storage unit.
  • the distribution information is calculated according to the distribution information, and the confidence that the text information belongs to each of the N event fields is calculated.
  • the first acquiring unit is specifically configured to: perform word segmentation processing on the text information, and extract at least one word segment; and search for the at least one word segment from the database model corresponding to each event field.
  • the database model includes a plurality of features, a weight of each feature, and a word segment corresponding to each feature, wherein the weight is used to indicate a probability that a feature corresponding to the weight belongs to a corresponding event domain in the database model;
  • Each event field corresponds to a database model; and based on the weights of the features found in the database model corresponding to each of the event fields, the domain probability of the text information attributed to each of the event fields is calculated.
  • the above voice information processing apparatus further includes: a storage unit.
  • the storage unit is configured to save a keyword model corresponding to each of the M event fields, where the keyword model includes: a plurality of keywords and each keyword indicating that the text information belongs to the keyword model The probability of the field of events.
  • the first acquiring unit is configured to: identify at least one keyword from the text information; and obtain, from the keyword model corresponding to each event field, a probability respectively indicated by the at least one keyword; and according to the at least one keyword The probability of each indication is calculated, and the domain probability attributed to the above-mentioned each event field is calculated.
  • the above voice information processing apparatus further includes: an execution unit.
  • the execution unit is configured to perform an operation corresponding to the semantic understanding result according to the semantic understanding result after the output unit outputs the semantic understanding result.
  • an embodiment of the present application provides a terminal, where the terminal includes: one or more processors; one or more memories, where one or more computer programs are stored in the one or more memories, the one The plurality of computer programs include instructions that, when executed by the one or more processors, cause the terminal to perform the voice information processing method as described in the first aspect and any of its possible design methods.
  • an embodiment of the present application provides an electronic device, where the electronic device includes a device for performing a voice information processing method according to the first aspect and any possible design method thereof.
  • an embodiment of the present application provides a computer program product including instructions, when the computer program product is run on an electronic device, causing the electronic device to perform the first aspect and any possible design method thereof The voice information processing method.
  • the embodiment of the present application provides a computer readable storage medium, where the computer readable storage medium includes instructions, when the instruction is run on an electronic device, causing the electronic device to perform the first aspect and any A possible method of designing a voice information processing method.
  • the device of the second aspect, the terminal of the third aspect, the electronic device of the fourth aspect, the computer program product of the fifth aspect, and the computer storage of the sixth aspect The medium is used to perform the corresponding method provided above, and therefore, the beneficial effects that can be achieved can be referred to the above The beneficial effects in the corresponding methods provided are not described here.
  • FIG. 1 is a schematic structural diagram of hardware of a terminal according to an embodiment of the present disclosure
  • FIG. 2 is a schematic structural diagram of a method for performing voice information processing according to an embodiment of the present application
  • FIG. 3 is a flowchart 1 of a voice information processing method according to an embodiment of the present application.
  • FIG. 4 is a second flowchart of a voice information processing method according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of historical records of a semantic understanding result of a voice information processing method according to an embodiment of the present disclosure
  • FIG. 6 is a flowchart 3 of a voice information processing method according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram 1 of an example of a keyword database according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram 1 of an execution process of a voice information processing method according to an embodiment of the present disclosure
  • FIG. 9 is a schematic diagram 2 of an example of a keyword database according to an embodiment of the present application.
  • FIG. 10 is a flowchart 4 of a voice information processing method according to an embodiment of the present disclosure.
  • FIG. 11 is a flowchart 5 of a method for processing voice information according to an embodiment of the present application.
  • FIG. 12 is a schematic diagram 1 of an example of a feature database according to an embodiment of the present application.
  • FIG. 13 is a schematic diagram 2 of an example of a feature database according to an embodiment of the present application.
  • FIG. 14 is a second schematic diagram of an execution process of a voice information processing method according to an embodiment of the present disclosure.
  • FIG. 15 is a schematic structural diagram 1 of a voice information processing apparatus provided by an example of the present application.
  • FIG. 16 is a second structural diagram of a voice information processing apparatus provided by an example of the present application.
  • FIG. 17 is a schematic structural diagram of a terminal provided by an example of the present application.
  • first and second are used for descriptive purposes only, and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, features defining “first” and “second” may include one or more of the features either explicitly or implicitly. In the description of the embodiments of the present application, “multiple” means two or more unless otherwise stated.
  • the embodiment of the present invention provides a voice information processing method and a terminal, which can be applied to a process in which a terminal performs a voice conversation with a user.
  • the method is specifically applied to the terminal receiving the voice information input by the user, performing semantic understanding on the voice information, and performing an event corresponding to the semantic understanding result. For example, the user controls the terminal through voice.
  • the semantic understanding of the voice information in the embodiment of the present application may include: converting the voice information into text information, and then analyzing the text information to identify an event performed by the terminal indicated by the text information. For example, when the terminal receives the voice information “Remind me to open the flight mode at 22:00” input by the user, the terminal can convert the voice information into text information “Remind me to open the flight mode at 22:00”, and then recognize the text information.
  • the event that the indicated terminal performs is "a reminder to "open the flight mode" to the user at 22:00” instead of directly “opening the flight mode”.
  • the data interaction between the terminal and the cloud server may cause the terminal to fail to execute the event corresponding to the semantic understanding result in time due to network failure or the like; Since the amount of voice information is generally large, conventional solutions consume a large amount of network traffic.
  • the voice information processing method provided by the embodiment of the present application may perform the above semantic understanding by the terminal.
  • the terminal when performing the above semantic understanding, the terminal simply analyzes the vocabulary included in the converted text information, determines an event field to which the text information belongs, and determines the event to which the text information belongs.
  • the domain is then semantically understood by the semantic understanding algorithm of the event domain to which the textual information belongs, and then the event corresponding to the semantic understanding result is executed.
  • the problem is that the simple analysis of the vocabulary included in the text information may not be accurate; the inaccurate event domain's dialogue engine adopts the inaccurate event domain semantic understanding algorithm.
  • the semantic understanding of the text information, the resulting semantic understanding results are not accurate. Therefore, the event corresponding to the semantic understanding result performed by the terminal may be different from the event that the voice information input by the user indicates the terminal, and affect the user experience.
  • the voice information is converted into text information, according to the historical data of the semantic understanding of the terminal, the prior probability of the text information belonging to each event field is obtained; the prior probability of the text information belonging to an event field is used to represent the historical data.
  • the probability that the textual information belongs to the field of the event is used to represent the possibility that the text information belongs to the event field.
  • the terminal can calculate the confidence that the text information belongs to each event field; the confidence that the text information belongs to an event field is used to characterize the degree of confidence that the text information belongs to the event field.
  • the terminal may calculate the probability value of the text information attributed to the event field according to the prior probability, the domain probability and the confidence degree of the text information, and obtain the probability that the text information belongs to each event field. value.
  • the terminal can semantically understand the text information by the dialog engine of the event domain with the highest probability value, and obtain the semantic understanding result as the semantic understanding result of the text information (ie, the voice information), and the terminal can perform the semantic understanding result correspondingly. event.
  • the accuracy of the selected event domain can be improved, thereby improving the accuracy of the semantic understanding result, thereby improving the terminal execution event and the user input voice.
  • the information indicates the compliance of the events performed by the terminal, which can improve the user experience.
  • the terminal in the embodiment of the present application may be a mobile phone (such as the mobile phone 100 shown in FIG. 1), a tablet computer, a personal computer (PC), and a personal number that allow the user to input the voice information to instruct the terminal to perform related operation events.
  • the event field to which the text information belongs in the embodiment of the present application refers to the domain to which the event executed by the terminal indicated by the semantic understanding result belongs to the semantic understanding of the text information.
  • the event field in the embodiment of the present application may include a music field, a setting field, an application (APP) field, and the like.
  • APP application
  • text messages such as “play song a” and “play next song” belong to the music field, "lower the screen.
  • Text information such as "brightness” and "open flight mode” belong to the setting field
  • text information such as "open WeChat” and "map navigation to A Street No. 10" belong to the APP field.
  • the mobile phone 100 is used as an example of the terminal.
  • the mobile phone 100 may specifically include: a processor 101, a radio frequency (RF) circuit 102, a memory 103, a touch screen 104, a Bluetooth device 105, and one or more sensors 106.
  • Components such as Wi-Fi device 107, positioning device 108, audio circuit 109, peripheral interface 110, and power supply device 111. These components can communicate over one or more communication buses or signal lines (not shown in Figure 1). It will be understood by those skilled in the art that the hardware structure shown in FIG. 1 does not constitute a limitation to a mobile phone, and the mobile phone 100 may include more or less components than those illustrated, or some components may be combined, or different component arrangements.
  • the processor 101 is a control center of the mobile phone 100, and connects various parts of the mobile phone 100 by using various interfaces and lines, and executes the mobile phone 100 by running or executing an application stored in the memory 103 and calling data stored in the memory 103.
  • processor 101 can include one or more processing units.
  • the processor 101 may further include a fingerprint verification chip for verifying the collected fingerprint.
  • the radio frequency circuit 102 can be used to receive and transmit wireless signals during transmission or reception of information or calls.
  • the radio frequency circuit 102 can process the downlink data of the base station and then process it to the processor 101; in addition, transmit the data related to the uplink to the base station.
  • radio frequency circuits include, but are not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like.
  • the radio frequency circuit 102 can also communicate with other devices through wireless communication.
  • the wireless communication can use any communication standard or protocol, including but not limited to global mobile communication systems, general packet radio services, code division multiple access, wideband code division multiple access, long term evolution, email, short message service, and the like.
  • the memory 103 is used to store applications and data, and the processor 101 executes various functions and data processing of the mobile phone 100 by running applications and data stored in the memory 103.
  • the memory 103 mainly includes a storage program area and a storage data area, wherein the storage program area can store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.); the storage data area can be stored according to the use of the mobile phone. Data created at 100 o'clock (such as audio data, phone book, etc.).
  • the memory 103 may include high speed random access memory (RAM), and may also include nonvolatile memory such as a magnetic disk storage device, a flash memory device, or other volatile solid state storage device.
  • the memory 103 can store various operating systems, for example, operating system, Operating system, etc.
  • the above memory 103 may be independent and connected to the processor 101 via the above communication bus; the memory 103 may also be integrated with the processor 101.
  • the touch screen 104 may specifically include a touch panel 104-1 and a display 104-2.
  • the touch panel 104-1 can collect touch events on or near the user of the mobile phone 100 (for example, the user uses any suitable object such as a finger, a stylus, or the like on the touch panel 104-1 or on the touchpad 104.
  • the operation near -1), and the collected touch information is sent to other devices (for example, processor 101).
  • the touch event of the user in the vicinity of the touch panel 104-1 may be referred to as a hovering touch; the hovering touch may mean that the user does not need to directly touch the touchpad in order to select, move or drag a target (eg, an icon, etc.) , and only the user is located near the device to perform the desired function.
  • resistive, capacitive, and infrared The touch panel 104-1 is implemented in various types such as surface acoustic waves.
  • a display (also referred to as display) 104-2 can be used to display information entered by the user or information provided to the user as well as various menus of the mobile phone 100.
  • the display 104-2 can be configured in the form of a liquid crystal display, an organic light emitting diode, or the like.
  • the touchpad 104-1 can be overlaid on the display 104-2, and when the touchpad 104-1 detects a touch event on or near it, it is transmitted to the processor 101 to determine the type of touch event, and then the processor 101 may provide a corresponding visual output on display 104-2 depending on the type of touch event.
  • the touchpad 104-1 and the display 104-2 are implemented as two separate components to implement the input and output functions of the handset 100, in some embodiments, the touchpad 104- 1 is integrated with the display screen 104-2 to implement the input and output functions of the mobile phone 100. It is to be understood that the touch screen 104 is formed by stacking a plurality of layers of materials. In the embodiment of the present application, only the touch panel (layer) and the display screen (layer) are shown, and other layers are in the embodiment of the present application. Not recorded in the middle.
  • the touch panel 104-1 may be disposed on the front surface of the mobile phone 100 in the form of a full-board
  • the display screen 104-2 may also be disposed on the front surface of the mobile phone 100 in the form of a full-board, so that the front of the mobile phone can be borderless. Structure.
  • the mobile phone 100 can also have a fingerprint recognition function.
  • the fingerprint reader 112 can be configured on the back of the handset 100 (eg, below the rear camera) or on the front side of the handset 100 (eg, below the touch screen 104).
  • the fingerprint collection device 112 can be configured in the touch screen 104 to implement the fingerprint recognition function, that is, the fingerprint collection device 112 can be integrated with the touch screen 104 to implement the fingerprint recognition function of the mobile phone 100.
  • the fingerprint capture device 112 is disposed in the touch screen 104 and may be part of the touch screen 104 or may be otherwise disposed in the touch screen 104.
  • the main component of the fingerprint collection device 112 in the embodiment of the present application is a fingerprint sensor, which can adopt any type of sensing technology, including but not limited to optical, capacitive, piezoelectric or ultrasonic sensing technologies. .
  • the mobile phone 100 may also include a Bluetooth device 105 for enabling data exchange between the handset 100 and other short-range devices (eg, mobile phones, smart watches, etc.).
  • the Bluetooth device in the embodiment of the present application may be an integrated circuit or a Bluetooth chip or the like.
  • the handset 100 can also include at least one type of sensor 106, such as a light sensor, motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display of the touch screen 104 according to the brightness of the ambient light, and the proximity sensor may turn off the power of the display when the mobile phone 100 moves to the ear.
  • the accelerometer sensor can detect the magnitude of acceleration in all directions (usually three axes). When it is stationary, it can detect the magnitude and direction of gravity. It can be used to identify the gesture of the mobile phone (such as horizontal and vertical screen switching, related Game, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.
  • the mobile phone 100 can also be configured with gyroscopes, barometers, hygrometers, thermometers, infrared sensors and other sensors, here Let me repeat.
  • a Wireless Fidelity (Wi-Fi) device 107 for providing the mobile phone 100 with network access complying with the Wi-Fi related standard protocol, and the mobile phone 100 can access the Wi-Fi access point through the Wi-Fi device 107.
  • Wi-Fi device 107 can also function as a Wi-Fi wireless access point, and can provide Wi-Fi network access to other devices.
  • the positioning device 108 is configured to provide a geographic location for the mobile phone 100. It can be understood that the positioning device 108 can be specifically a receiver of a positioning system such as a Global Positioning System (GPS) or a Beidou satellite navigation system, or a Russian GLONASS. After receiving the geographical location transmitted by the positioning system, the positioning device 108 sends the information to the processor 101 for processing, or sends it to the memory 103 for storage. In some other embodiments, the positioning device 108 can also be a receiver of an Assisted Global Positioning System (AGPS), which assists the positioning device 108 in performing ranging and positioning services by acting as an auxiliary server.
  • AGPS Assisted Global Positioning System
  • the secondary location server provides location assistance over a wireless communication network in communication with a location device 108 (i.e., a GPS receiver) of the device, such as handset 100.
  • the positioning device 108 can also be a Wi-Fi access point based positioning technology. Since each Wi-Fi access point has a globally unique Media Access Control (MAC) address, the device can scan and collect the surrounding Wi-Fi access points when Wi-Fi is turned on. The broadcast signal, so the MAC address broadcasted by the Wi-Fi access point can be obtained; the device sends the data (such as the MAC address) capable of indicating the Wi-Fi access point to the location server through the wireless communication network, and is retrieved by the location server. The geographic location of each Wi-Fi access point, combined with the strength of the Wi-Fi broadcast signal, calculates the geographic location of the device and sends it to the location device 108 of the device.
  • MAC Media Access Control
  • the audio circuit 109, the speaker 113, and the microphone 114 can provide an audio interface between the user and the handset 100.
  • the audio circuit 109 can transmit the converted electrical data of the received audio data to the speaker 113 for conversion to the sound signal output by the speaker 113; on the other hand, the microphone 114 converts the collected sound signal into an electrical signal by the audio circuit 109. After receiving, it is converted into audio data, and then the audio data is output to the RF circuit 102 for transmission to, for example, another mobile phone, or the audio data is output to the memory 103 for further processing.
  • the peripheral interface 110 is used to provide various interfaces for external input/output devices (such as a keyboard, a mouse, an external display, an external memory, a subscriber identity module card, etc.). For example, it is connected to the mouse through a Universal Serial Bus (USB) interface, and is connected to a Subscriber Identification Module (SIM) card provided by the service provider through a metal contact on the card slot of the subscriber identity module. . Peripheral interface 110 can be used to couple the external input/output peripherals described above to processor 101 and memory 103.
  • USB Universal Serial Bus
  • SIM Subscriber Identification Module
  • the mobile phone 100 can communicate with other devices in the device group through the peripheral interface 110.
  • the peripheral interface 110 can receive display data sent by other devices for display, etc. No restrictions are imposed.
  • the mobile phone 100 may further include a power supply device 111 (such as a battery and a power management chip) that supplies power to the various components.
  • the battery may be logically connected to the processor 101 through the power management chip to manage charging, discharging, and power management through the power supply device 111. And other functions.
  • the mobile phone 100 may further include a camera (front camera and/or rear camera), a flash, a micro projection device, a near field communication (NFC) device, and the like, and details are not described herein.
  • a camera front camera and/or rear camera
  • a flash a flash
  • micro projection device a micro projection device
  • NFC near field communication
  • FIG. 2 is a schematic structural diagram of a method for performing voice information processing according to an embodiment of the present application, where the architecture is located in a terminal.
  • the architecture includes a central control layer 201, a dialog engine layer 202, and an algorithm layer 203.
  • the central control layer 201 includes: a voice service interface (VSI) 2011, The domain identification module 2012, the scheduling distribution module 2013, and the Decision Summary (DS) module 2014.
  • VSI voice service interface
  • DS Decision Summary
  • the dialog engine layer 202 includes a dialog engine 1, a dialog engine 2, and a dialog engine 3.
  • the algorithm layer 203 includes: a "model and algorithm library” 2031, a rule library 2032, a points of interest (POI) library 2033, and a state model 2034.
  • the central control layer 201 is configured to receive voice information through the VSI 2011 (eg, receive voice information from a third party application), and then transmit the received voice information to the domain identification module 2012.
  • the domain identification module 2012 is configured to convert the received voice information into text information, and perform preliminary domain identification on the text information, identify at least two event fields of the text information, and then The identification result is transmitted to the scheduling distribution module 2013; wherein the domain identification module 2012 can schedule the "model and algorithm library" 2031 in the algorithm layer 203, the rule library 2032, the POI library 2033, and the state model 2034, and perform the above text information. Domain identification.
  • the "model and algorithm library” 2031 may include a plurality of algorithms (also referred to as models) for supporting the dialogue engine (such as the dialogue engine 1) in the domain identification module 2012 and the dialog engine layer 202. Text information is analyzed.
  • the "model and algorithm library” 2031 in the algorithm layer 203 includes: Logistic Regression/Support Vector Machine (LR/SVM) algorithm, word frequency-reverse file frequency.
  • LR/SVM Logistic Regression/Support Vector Machine
  • N-Gram/WS Wide Segment, word segmentation
  • SRL Semantic Role Label
  • POS part of speech
  • NER Named Entity Recognition
  • CRF Conditional Random Field
  • SMT Statistic Machine Translation
  • DNN Deep Reinforce Learning Network
  • Algorithms such as Convolution/Recurrent Neural Net (C/RNN) algorithm and Long Short Term Memory (LSTM) algorithm.
  • N-Gram is a commonly used language model in large vocabulary continuous speech recognition.
  • C/RNN Convolution/Recurrent Neural Net
  • LSTM Long Short Term Memory
  • N-Gram is a commonly used language model in large vocabulary continuous speech recognition.
  • Chinese it can be called Chinese Language Model (CLM).
  • CLM Chinese Language Model
  • the Chinese language model can utilize the collocation information between adjacent words in the context of the voice information, and can automatically convert the voice information to the Chinese characters (ie, text information).
  • the Rule Base 2032 in the Algorithm Layer 203 may include semantic understanding rules for text information attributed to respective event fields.
  • the Rule library 2032 may include a semantic understanding rule of text information attributed to the APP domain, a semantic understanding rule of text information attributed to the set field, and a semantic understanding rule of text information attributed to the music domain.
  • the semantic understanding rule of an event field in the Rule library 2032 can be used to indicate an algorithm to be called from the "model and algorithm library" 2031 when semantically understanding the text information attributed to the event field.
  • the semantic understanding rules of the APP domain in the Rule library 2032 can be used to indicate the semantic understanding of the text information attributed to the APP domain, and the LR/SVM algorithm and the F-IDF algorithm can be called from the "model and algorithm library”. .
  • the POI library 2033 may be an object name including a rule in the Rule library 2032 (such as a restaurant name, a school name, etc.), an object address (such as a restaurant address, a school address, etc.), a latitude and longitude, a category (such as a school). Data collection of information such as restaurants, government agencies, and shopping malls.
  • the POI library 2033 may include singers, songs, and the like in the semantic understanding rules of the text information attributed to the music domain in the Rule library 2032.
  • the POI library 2033 can maintain multiple data sets according to different addresses; or, multiple data sets can be maintained according to different categories.
  • the state model 2034 is a model of the dialog engine management dialog state in the dialog engine layer 202, which may be a custom model such as a deterministic model, a probabilistic model, a Markov model, and the like.
  • the state model 2034 can provide a transition between different dialog states during the terminal's dialogue with the user.
  • the probability model refers to a result that after the user inputs the voice information, the probability that the text information corresponding to the voice information belongs to the navigation field is greater than the preset value, and the semantic understanding of the text information in the navigation field is input.
  • the scheduling distribution module 2013 is configured to distribute the text information to a dialogue engine (such as the dialogue engine 1) corresponding to the at least two event areas indicated by the recognition result, and the corresponding dialogue engine respectively performs the natural language on the text information.
  • a dialogue engine such as the dialogue engine 1
  • DM Dialogue Management
  • NLP Natural Language Process
  • each of the dialog engine layers 202 is caused to correspond to an event realm.
  • the dialogue engine 1 corresponds to a setting field
  • the dialogue engine 2 corresponds to an APP field
  • the dialogue engine 3 corresponds to a music field.
  • the dialog engine of each event field may include: an NLU module, a DM module, and an NLP module for semantically understanding text information to obtain a speech understanding result.
  • Each of the dialog engines may call the model in the algorithm layer 203 and the algorithm library 2031, the Rule library 2032, the model, algorithm, and rules corresponding to the dialog engine in the POI library 2033 to perform semantic understanding on the text information.
  • each of the dialog engines may transmit the obtained semantic understanding result to the DS module 2014, and the DS module 2014 performs the method steps in the embodiment of the present application, and selects the text information from the semantic understanding results fed back by the plurality of dialog engines.
  • the semantic understanding result corresponding to the event domain with the highest attribution probability value ie, the optimal semantic understanding result as shown in FIG. 3
  • the selected semantic understanding result is used as the semantic understanding result of the text information, through the above VSI interface Feedback the semantic understanding of the results.
  • the functions of the central control layer 201 and the dialog engine layer shown in FIG. 2 and a part of the algorithm layer 203 can be integrated into the processor 101 of the mobile phone 100 shown in FIG. 1, and the algorithm layer 203 shown in FIG.
  • Information such as algorithms and rules can be stored in the memory 103 of the handset 100 shown in FIG. That is, the architecture for performing voice information processing shown in FIG. 2 can be located in the mobile phone 100 shown in FIG. 1.
  • the voice recognition method includes S401-S406:
  • the terminal receives the voice information, and converts the voice information into text information.
  • the terminal has M event fields preset.
  • the terminal may call an algorithm for performing voice text conversion (such as the N-Gram/WS algorithm) in the algorithm layer 203 shown in FIG. 2, and convert the voice information into text information. .
  • the terminal may convert the voice information into text information by calling a voice-to-text (speech-to-text) program.
  • the specific manner of converting the voice information into text information is not limited.
  • the voice information received by the terminal generally refers to the voice sent by the user; that is, the voice is sent by the user, and then the terminal receives the voice and performs subsequent actions.
  • M event fields such as music field, setting field, and application APP field, can be preset in the terminal.
  • the terminal acquires a domain probability attributed to each event field in the M event fields, where the domain probability is used to represent the possibility that the text information belongs to an event field.
  • the terminal can invoke the algorithm for semantic parsing of the text information in the algorithm layer 203 through the domain identification module 2012 in the central control layer 201 shown in FIG. 2, and obtain the text information belonging to each of the N event fields.
  • the domain probability of the event domain The higher the probability that the text information belongs to an event field, the higher the possibility that the text information belongs to the event field.
  • the terminal may obtain, by using the M event areas, the N event areas corresponding to the text information.
  • the method of the present application may further include S402':
  • S402' The terminal acquires N event areas corresponding to the text information from the M event areas.
  • the N event fields in the embodiment of the present application are the M event fields. Part of the event area, N ⁇ M.
  • the above N event fields are the above-mentioned preset M event fields, and the domain probability is arranged in the top N N event fields in descending order, N ⁇ 2. That is, after executing the S402, the terminal may select an event field in which the domain probability is ranked in the first N bits from the above M event fields according to the highest and lowest domain probability.
  • the probability that the text information belongs to the field of the event field 1 is 50%
  • the probability that the text information belongs to the field of the event field 2 is 25%
  • the probability that the text information belongs to the field of the event field 3 is 10%
  • the text information is The probability of belonging to the field of event 4 is 15%. Since 50%>25%>15%>10%; therefore, the terminal can select the field probability from the above-mentioned event field 1 - event field 4 in the top 3 event fields in order of highest to lowest, ie Event Area 1, Event Area 2, and Event Area 4.
  • the terminal when performing the S403-S405, the terminal only needs to calculate the prior probability and the confidence that the text information belongs to the N event fields, and does not need to calculate the prior information of all the event fields in the M event fields. Probability and confidence can reduce the amount of computation when the terminal performs voice information processing and improve computational efficiency.
  • S402' is optional, and the terminal may not perform S402'.
  • the method of the present application may include S403-S406:
  • the terminal acquires a prior probability that the text information belongs to each of the N event fields, and the prior probability is used to determine that the text information belongs to an event according to multiple semantic understandings that have been performed.
  • the probability of the domain, the N event fields are N event fields in M event fields, and N is less than or equal to M.
  • the prior probability of an event field is used to characterize the probability that the text information belongs to the event field according to the multiple semantic understandings that have been made.
  • the terminal may separately acquire the prior probability of each of the N event fields according to historical data that has been semantically understood multiple times.
  • the present application obtains, by using a terminal, a prior probability that the text information belongs to the first event field in the N event fields, and obtains the text information from the terminal to belong to each event field in the N event fields.
  • the method of prior probability is illustrated.
  • the first event field may be the above N events Any field of events in the field.
  • the terminal may count the total number X of semantic understandings of the terminal; and count the number of semantic understanding results of the event that the terminal performs in the first time event domain in the X-th semantic understanding; Calculate the ratio y/X of the number y of semantic understanding results to the total number of semantic understandings X, which is the prior probability of the first event domain.
  • the total number X of semantic understandings of the terminal statistics refers to the total number of semantic understandings that the terminal has performed.
  • the "all semantic understanding" here does not limit the object of semantic understanding, that is, including the semantic understanding of the terminal for any text information.
  • a semantic understanding result indicates that the event executed by the terminal belongs to the music domain (referred to as semantic understanding of the music domain), and b semantics.
  • the terminal can determine that the prior information probability of the text information to be processed (such as text information K) belongs to the music field is a/P, the prior probability of the text information K belongs to the set field is b/P, and the text information K belongs to The prior probability of the APP field is c/P.
  • the prior information probability of the text information to be processed such as text information K
  • the prior probability of the text information K belongs to the set field is b/P
  • the text information K belongs to The prior probability of the APP field is c/P.
  • the object of two adjacent semantic understandings ie, text information
  • the object of the previous semantic understanding belongs to an event field, which may affect the latter semantics.
  • the object of understanding belongs to which event domain has an impact.
  • the voice information input by the user for the previous time may be “navigation”, and then the mobile phone receives the voice information input by the user as “go to x street 100” or other location information.
  • the "navigation” and the "to x street 100" can be attributed to the APP field, and are used to instruct the mobile phone to invoke the map APP to execute the corresponding event.
  • the terminal when the terminal acquires the prior probability of the text information K belonging to each of the N event fields, if the object of the previous semantic understanding belongs to the event domain A. Then, the terminal can determine that the prior probability of the text information K belonging to the event field A is a, a>0.5, and the prior probability of the text information K belonging to any other event field is (1-a)/(N- 1).
  • the terminal when calculating the prior probability of the text information K belonging to the first event field (such as the event field P), the terminal may refer to the text information of the previous semantic understanding of the terminal.
  • the attribute field of attribution (denoted as event field Q), and then count all the semantic understandings that the terminal has performed, the probability that two adjacent semantically understood text information are attributed to event field Q and event field P in turn (ie In chronological order, the event field Q is in the front and the event field P is in the back, and the probability is determined as the prior probability that the text information K belongs to the first event field (such as the event field P).
  • the terminal has performed a total of Y semantic understandings.
  • the semantic understanding result of the Y-time semantic understanding indicates that the event areas to which the event executed by the terminal belongs are: setting the field, the APP field, and setting the collar. Domain, music field, music field, setting field, APP field, setting field... setting field.
  • the terminal counts the number of semantic understandings P of the domain in the above-mentioned Y-time semantic understanding.
  • the terminal acquires the semantic understanding of the P setting fields, and sets the event domain corresponding to the next semantic understanding of the domain each time; the previous semantic understanding is to set the domain, and the adjacent semantic understanding is to set the domain.
  • a the previous semantic understanding is to set the field, the next time the semantic understanding is the number of times the music field is b
  • the previous semantic understanding is to set the field
  • the terminal can determine that the prior probability of the text information K belongs to the set field is a/P, the prior probability of the text information K belongs to the music field is b/P, and the prior probability of the text information K belongs to the APP domain is c /P.
  • the previous time and the last time appearing in this embodiment refer to the order of time, the previous one occurs first, and the second one occurs after the last one.
  • the method for the terminal to acquire the prior probability of the text information belonging to each of the N event areas includes, but is not limited to, the foregoing method, and the terminal obtains the text information attributed to each event.
  • Other methods of a priori probability of the domain are not described herein again.
  • the terminal acquires a confidence that the text information belongs to each of the N event areas, and the confidence is used to represent the degree of confidence that the text information belongs to an event field.
  • the terminal may save a keyword model for each of the M event fields, and the keyword model of each event domain includes multiple keywords of the event domain, and the multiple keywords. Words are commonly used words and phrases in the field of events.
  • the terminal may perform word segmentation processing on the text information, and extract at least one word segment, and then calculate, according to the distribution of the keyword corresponding to the at least one word segment in the keyword models of the plurality of event fields, that the text information belongs to the preset Confidence in each event area in multiple event areas.
  • S404 shown in FIG. 4 may include S404a-S404c:
  • the terminal performs word segmentation on the text information, and extracts at least one word segmentation.
  • the terminal may call the algorithm layer 203 to perform word segmentation on the text information and extract at least one word segment through the domain identification module 2011 in the central control layer 201 shown in FIG. 2 .
  • the terminal may perform word segmentation processing on the text information, and extract the following participles: “play”, “singer A”, and “song B”.
  • the terminal receives the text message “Help me to turn down the volume” during the process of playing the song of singer A.
  • the terminal can process the word information and extract the following participles: “help”, “me”, “tune” Low” and "volume”.
  • S404b The terminal acquires distribution information of keywords corresponding to the at least one participle in the keyword model of each event field.
  • a keyword database 701 as shown in FIG. 7 may be maintained, and the keyword database 701 may include a keyword model of multiple event domains. Assume that two event fields, such as music fields and setting fields, are preset in the terminal. As shown in FIG. 7, the keyword database 701 includes a keyword model 702 of a music domain and a keyword model 703 of a setting domain. Among them, the keyword model 702 in the music field includes a plurality of keywords in the music field, such as play, next song, play, singer, rock, and song. The keyword model 703 of the set field includes a plurality of keywords for setting a domain, such as flight mode, Bluetooth, brightness, volume, and down.
  • the terminal calculates, according to the distribution information, that the text information belongs to the N event areas. Confidence in each event area.
  • the confidence that the text information belongs to the first event field is used to indicate the degree of confidence that the text information belongs to the first event field.
  • the terminal may determine the keyword play corresponding to the participle "play”, the keyword “singer” corresponding to the segmentation "singer A”, and the keyword “song” corresponding to the segmentation "song B", all of which are included in the keyword model 702 of the music field. in. That is, the terminal can determine that the keywords corresponding to all the participles of the text information 1 are distributed in the keyword model 702 of the music field. In this case, the terminal can determine that the text information 1 has a confidence level of 90% attributed to the music field, and the text information 1 has a confidence level of 10% attributed to the setting field.
  • the terminal performs word segmentation on the text information 2 to obtain at least one word segmentation.
  • the terminal may determine that the confidence of the text information 2 belonging to the setting field is 80%, and the confidence that the text information 2 belongs to the music domain is 10 %, the confidence level of text information 2 belonging to the APP field is 10%.
  • the terminal performs word segmentation on the text information 3 to obtain 8 word segments.
  • the keywords corresponding to the 5 particials of the 8 participles are distributed in the keyword model of the set field
  • the keywords corresponding to the 2 participles are distributed in the keyword model of the music domain
  • the keywords corresponding to 1 participle are distributed in the keyword model.
  • the confidence level of the field is 12.5%.
  • the embodiment of the present application does not limit the sequence in which the terminal performs S402, S403, and S404.
  • the terminal may perform S403 first, then S404, and finally S402.
  • the terminal may execute S404 first, then S402, and finally S404.
  • the terminal may perform S402, S403, and S404 substantially simultaneously.
  • the method of the present application may further include S402'.
  • the terminal may first execute S402, then S402', and finally S403 and S404.
  • the embodiment of the present application does not limit the order in which the terminal performs S403 and S404. For example, the terminal may execute S403 first, and then execute S404; or, the terminal may execute S404 first, and then execute S403; or, the terminal may perform S403 and S404 substantially simultaneously.
  • the terminal calculates, according to the domain probability, the prior probability, and the confidence level of each of the N event areas, the text information is calculated by the N probability values respectively attributed to the N event fields.
  • the terminal may calculate a product of the prior probability, the domain probability, and the confidence that the text information belongs to the first event field, and determine the calculated product as the probability value that the text information belongs to the first event domain.
  • the prior information probability that the text information a belongs to the music field is 40%
  • the prior probability of the text information a belongs to the setting field is 30%
  • the text information a belongs to the APP field first.
  • the probability of detection is 30%
  • the probability that the text information a belongs to the music field is 40%
  • the probability that the text information a belongs to the set field is 20%
  • the probability that the text information a belongs to the APP domain is 40%
  • the text information a confidence attributed to the music field is 10%
  • the confidence level of the text information a belongs to the setting field is 10%
  • the confidence that the text information a belongs to the APP field is 80%.
  • the terminal can calculate that the text information a belongs to the music field.
  • the terminal outputs a semantic understanding result of semantically understanding the text information according to the event domain with the highest probability value among the N probability values.
  • the S406 may be replaced by: the terminal will use the semantic understanding result of semantically understanding the text information according to the event domain with the highest probability value among the N probability values as the final semantic understanding result.
  • the probability value of the text information belonging to each event field may be obtained, that is, multiple probability values are acquired. Then, the terminal acquires the event domain corresponding to the highest probability value, and identifies the event domain with the highest probability value as the event domain corresponding to the text information. After the domain identifies the text information, the terminal can transmit the text information to the dialogue engine of the identified event domain, and the dialogue engine semantically understands the text information to obtain a semantic understanding result.
  • the present embodiment may not limit the order in which the terminal performs domain identification and semantic understanding, and may perform domain identification and semantic understanding at the same time or substantially simultaneously, or perform domain recognition after semantic understanding.
  • the dialogue engine in the music field, the dialogue engine in the setting domain, and the dialogue engine in the APP domain can respectively perform semantic understanding on the text information a to obtain a semantic understanding result.
  • the probability value of the text information a belonging to the APP domain is 9.6% greater than the probability value of the text information a belonging to the music domain of 1.6%
  • the probability value of the text information a belonging to the APP domain is 9.6% greater than the text.
  • the information a belongs to the probability value of the set field after 0.6%; the terminal can output the semantic understanding result obtained by semantically understanding the text information a by the dialogue engine of the APP domain.
  • the terminal may perform semantic understanding on the text information in the N event fields before S406 to obtain N semantic understanding results.
  • the method in the embodiment of the present application may further include S406':
  • the terminal performs semantic understanding on the text information in the N event fields, and obtains N semantic understanding results.
  • the method in which the terminal performs a semantic understanding on the text information in each of the N event fields, and obtains the results of the N semantic understandings may refer to the related description in the foregoing embodiment of the present application. No longer.
  • the terminal may further perform an operation corresponding to the semantic understanding result according to the semantic understanding result.
  • the method in the embodiment of the present application may further include S407:
  • the terminal After the terminal outputs the semantic understanding result, the terminal performs an operation corresponding to the semantic understanding result according to the semantic understanding result.
  • the terminal will use the semantic understanding result of the semantic understanding of the text information according to the event domain with the highest probability value as the result of the semantic recognition of the final recognition.
  • the terminal may output the final result to the terminal internally, so that the terminal performs an operation corresponding to the final result.
  • the output to the terminal may be a process in which the terminal determines the final result with the highest probability value, or the terminal may send the final result to other internal components (hardware or software), so that the final result corresponds to The operation is performed by the terminal.
  • the terminal may also output the final semantic understanding result to the external part of the terminal, for example, The terminal may send the final result to other terminals, so that other terminals know the final result, or cause other terminals to perform the action corresponding to the final result.
  • the terminal may perform the final result corresponding operation and output the final result to the outside.
  • the embodiment of the present invention provides a voice information processing method, which can obtain a prior probability of the text information belonging to each event field according to historical data of the semantic understanding of the terminal after converting the voice information into text information; analyzing the text Information, obtaining a domain probability that the text information belongs to each event field; and the terminal may calculate a confidence that the text information belongs to each event field; and then, the terminal belongs to an prior probability of an event field according to the text information, The domain probability and the confidence level are used to calculate the probability value that the text information belongs to the event domain. Finally, the terminal can semantically understand the text information by the dialogue engine of the event domain with the highest probability value, and obtain the semantic understanding result as the text information. The semantic understanding of the result (ie, the above voice information).
  • the text information belongs to an a priori probability of an event field: a probability used to represent the history data, the text information belongs to the event domain; the text information belongs to an event domain domain probability: used to represent the text information belongs to The likelihood of the field of the event; the confidence that the textual information belongs to an event domain: the degree of confidence that the textual information is attributed to the field of the event.
  • the embodiment of the present application not only refers to the domain probability obtained by analyzing the vocabulary included in the text information, but also refers to the prior probability that the text information belongs to the event domain, and is used to represent the domain probability.
  • the compliance of the executed events can improve the user experience.
  • the keyword model may include not only multiple keywords, but also a probability that each keyword indicates that the text information belongs to the corresponding event domain.
  • the keyword database 901 includes a keyword model 902 of a music domain and a keyword model 903 of a setting domain.
  • the keyword model 902 of the music field may further include: a keyword “next song” indicating a probability “probability a” that the text information belongs to the music domain, and a keyword “playing” indicating a probability that the text information belongs to the music domain” Probability b", the keyword “singer” indicates the probability “probability c” that the text information belongs to the music field, the keyword “play” indicates the probability "probability d” that the text information belongs to the music field, and the keyword “song” indicates the text information.
  • the keyword model 903 of the setting domain may further include: a keyword "flight mode” indicating a probability that the text information belongs to the setting field "probability 1", and a keyword “bluetooth” indicating a probability that the text information belongs to the setting field "probability 2"
  • the keyword "volume” indicates the probability that the text information belongs to the set field "probability 3" and the keyword “down” indicates the probability "probability 4" of the text information belonging to the set field, and the like.
  • the keyword database in the embodiment of the present application may be stored in the terminal.
  • the keyword database may also be saved in the cloud server. The terminal can search for the corresponding keyword and the probability indicated by the keyword from the keyword database saved by the cloud server.
  • the terminal may identify at least one keyword from the text information; and then, according to the probability indicated by the at least one keyword, calculate a domain probability that the text information belongs to each event field.
  • S402 S1001-S1003 can be included.
  • S402 in FIG. 4 may be replaced with S1001-S1003:
  • the terminal identifies at least one keyword from the text information.
  • the terminal may identify, for each event field, whether the keyword in the keyword model of the event domain is included in the text information. For example, suppose that two event fields, such as a music field and a setting field, are set in advance in the terminal, and the text information 4 is "the volume of the song is lowered when the next song is played". The terminal can recognize that the text information 4 includes the keywords “play”, “next track”, “down”, “song", and “volume”. Among them, “play”, “next song” and “song” are keywords in the keyword model of the music field, and “down” and “volume” are keywords in the keyword model of the set field.
  • S1002 The terminal acquires, from the keyword model corresponding to each event area, a probability that the at least one keyword is respectively indicated.
  • the keyword “play” indicates that the probability that the text information belongs to the music field is probability b
  • the keyword “next song” indicates that the probability that the text information belongs to the music field is probability a
  • the keyword The “song” indicates that the probability that the text information belongs to the music field is the probability e.
  • the keyword “down” indicates that the probability that the text information belongs to the set field is probability 4
  • the keyword “volume” indicates that the probability that the text information belongs to the set field is probability 3.
  • the terminal calculates, according to a probability indicated by the at least one keyword, a domain probability that the text information belongs to each of the event fields.
  • the domain probability that the text information 4 belongs to the music domain may be the sum of the probability b, the probability a, and the probability e; the domain probability that the text information 4 belongs to the set domain may be the sum of the probability 4 and the probability 3.
  • the terminal may further normalize the probability indicated by the at least one keyword to calculate a domain probability that the text information belongs to each event field.
  • the domain probability that the text information 4 belongs to the music domain may be (probability b+probability a+probability e)/3; the domain probability that the text information 4 belongs to the set domain may be (probability 4+probability 3)/2.
  • the terminal may identify at least one keyword from the text information, and then calculate a domain probability that the text information belongs to each event field according to the probability indicated by the at least one keyword.
  • at least one keyword may include keywords in a keyword model of each event domain, and each keyword is in a keyword model of a different event domain, the probability that the text information belongs to the corresponding event domain may be indicated. Therefore, based on the probability indicated by the keywords of the respective event fields included in the text information, the domain probability at which the text information belongs to each event field can be calculated.
  • the terminal may maintain a feature database, where the feature database includes a database model of the multiple event domains.
  • Each database model includes a plurality of features and weights of each feature and their corresponding word segments, the weights of which are used to indicate the probability that the corresponding feature belongs to the corresponding event domain.
  • the terminal may perform the following operations for any event field to calculate a domain probability that the text information belongs to the event domain: word segmentation processing of the text information is extracted to at least one word segment, and then at least one word segment corresponding is searched from the database model of the event domain.
  • the feature is further calculated according to the weight of the found feature, and the domain probability of the text information belongs to the event field.
  • the above S402 may include S1101-S1103.
  • S402 in FIG. 4 may be replaced with S1101-S1103:
  • S1101 The terminal performs word segmentation on the text information, and extracts at least one word segment.
  • the method for performing word segmentation on the text information by the terminal and extracting at least one word segmentation may refer to the above The detailed description in S404a in the embodiment is not described herein again.
  • the terminal searches for a feature corresponding to the at least one word segment from the database model corresponding to each event field, where the database model includes multiple features, weights of each feature, and word segmentation corresponding to each feature, where the weight is used to indicate The feature corresponding to the weight belongs to the probability of the corresponding event domain in the database model; wherein each event domain corresponds to a database model.
  • the terminal may count a plurality of participles that appear in each of the event fields, and then assign a feature to each participle that can be located to identify the participle. For example, the terminal may assign a number to the word segment that uniquely identifies the word segment. The number may be a decimal number or a binary number, or the number may also be a number in other formats. The format is not limited. Then, the terminal can understand the probability value of each of the above-mentioned partial words belonging to each event field according to historical semantics, and determine the probability that the feature corresponding to each participle belongs to each event field.
  • the terminal may maintain a feature database 1201 including a database model 1202 of the event domain 1 and a database model 1203 of the event domain 2, and the like.
  • the database model 1202 of the event field 1 includes: the feature 102, the segmentation a corresponding to the feature 102 and the weight of the feature 102 in the event domain 1 30%; the feature 23, the segmentation b corresponding to the feature 23, and the feature 23 in the event domain 1
  • the weight is 15%;
  • the feature 456, the segmentation c corresponding to the feature 456 and the feature 456 have a weight of 26% in the event domain 1;
  • the feature 78, the segmentation d corresponding to the feature 78, and the feature 78 have a weight of 81% in the event domain 1.
  • the database model 1203 of the event field 2 includes: the feature 375, the parting e corresponding to the feature 375 and the feature 375 having a weight of 62% in the event field 2; the feature 102, the parting a corresponding to the feature 102, and the weight of the feature 102 in the event field 2 40 %; feature 168, feature 268 corresponding segmentation f and feature 168 have a weight of 2% in event domain 2; feature 456, feature segment 456 corresponding segmentation c and feature 456 have a weight of 53% in event domain 2.
  • the same participle has the same feature in the database model of different event domains, that is, in the feature database, the feature of the word segment can uniquely identify the word segmentation.
  • the same participle has different weights in different event areas.
  • the feature of the participle a is 102
  • the feature of the participle c is 456.
  • the feature 102 has a weight of 20% in the event domain 1 and a weight of 40% in the event domain 2
  • the feature 456 has a weight of 26% in the event domain 1 and a weight of 53% in the event domain 2.
  • the terminal performs word segmentation on the text information, and can obtain the following participle: “Let”, “Blu-ray", “Yes”, “Eyes” ",” “radiation”, “less”, “a little”.
  • the feature of the word “let” is 5545
  • the feature of the participle "Blu-ray” is 2313
  • the feature of the participle " ⁇ ” is 2212
  • the feature of the participle "eye” is 9807
  • the feature of the participle "of” is 44
  • the characteristic is 3566
  • the feature of the word “small” is 4324
  • the feature of the participle "a point” is 333.
  • the terminal can determine the characteristic model of the text message "Let the blue light be less radiated to the eye”: 5545, 2313, 2212, 9807, 44, 3566, 4324, 333.
  • the terminal calculates, according to the weight of the feature found in the database model corresponding to each event field, a domain probability that the text information belongs to each event field.
  • the terminal may perform the following operations for each event field to calculate a domain probability of the corresponding event domain: the terminal searches for each feature in the feature model from the database model of any event domain in the event domain. The weight in the terminal; the terminal calculates the sum of the found weights. The sum of the weights calculated by the terminal is the domain probability that the text information belongs to the event field.
  • the feature 5554 has a weight of 21% in the music field, and the feature 5554 has a weight of 79% in the setting field; the feature 2313 is in the music field.
  • the weight is 12%, the weight of feature 5545 in the setting field is 88%; the weight of feature 2212 in the music field is 69%, the weight of feature 5545 in the setting field is 31%; the weight of feature 9807 in the music field is 56%, feature
  • the weight of 5545 in the setting field is 44%; the weight of feature 44 in the music field is 91%, the weight of feature 5545 in the setting field is 9%; the weight of feature 3566 in the music field is 56%, and the weight of feature 5545 in the setting field is It is 44%; feature 4324 has a weight of 75% in the music field, feature 5545 has a weight of 25% in the setting field; feature 333 has a weight of 12% in the music field, and feature 5545 has a weight of 88% in the setting field.
  • the terminal may maintain a feature database, where the feature database includes a database model and a feature relationship model of the multiple event domains.
  • Each database model includes a plurality of features and a weight of each feature, the weight of the feature being used to indicate the probability that the corresponding feature belongs to the corresponding event domain.
  • the feature relationship model includes a plurality of features and word segments corresponding to each feature.
  • the terminal may maintain a feature database 1301 including a database model 1302 of the event domain 1, a database model 1303 of the event domain 2, and a feature relationship model 1304.
  • the feature relationship model 1304 includes: a feature 102 corresponding to the participle a and the participle a; a feature 23 corresponding to the participle b and the participle b; a feature 456 corresponding to the participle c and the participle c; a feature 78 corresponding to the participle d and the participle d; The feature 375 corresponding to e and the participle e; the feature 168 corresponding to the participle f and the participle f, and the like.
  • the database model 1302 of the event field 1 includes: the feature 102 and the feature 102 have a weight of 30% in the event domain 1; the feature 23 and the feature 23 have a weight of 15% in the event domain 1; the feature 456 and the feature 456 have a weight in the event domain 1 26 %; feature 78 and feature 78 have a weight of 81% in event field 1.
  • the database model 1303 of the event field 2 includes: the feature 375 and the feature 375 have a weight of 62% in the event field 2; the feature 102 and the feature 102 have a weight of 40% in the event field 2; the feature 168 and the feature 168 have a weight 2 in the event field 2 %; feature 456 and feature 456 have a weight of 53% in event field 2.
  • the terminal may first search for the feature corresponding to the at least one word segment from the feature relationship model 1304 shown in FIG. 13 to determine a feature model of the text information; and then, from the event field 1 In the database model 1302, the weight of the feature in the feature model in the event domain 1 is searched to calculate the domain probability that the text information belongs to the event domain 1; from the database model 1303 of the event domain 2, the feature in the feature model is searched The weight of the event field 2 above is calculated to calculate the domain probability of the text information attributed to the event field 2.
  • the terminal when performing S404a-S404c to calculate the degree of certainty that the text information belongs to each event field, the terminal performs "word segmentation processing on the text information and extracts at least one participle"; and calculates the text in S1101-S1103.
  • the terminal When the information belongs to the domain probability of each event field, the terminal also performs "word segmentation processing of the text information and extracts at least one word segmentation".
  • the text information is subjected to word segmentation processing, and at least one word segment is extracted, and the terminal performs S404a-S404c to calculate the degree of certainty that the text information belongs to the first event field, and executes S1101-S1103 to calculate the domain probability that the text information belongs to the first event field.
  • the terminal may only execute S404a and not execute S1101; or the terminal may only execute S1101 and not execute S404a.
  • the method provided in the embodiment of the present application can be applied not only to the single-round voice dialogue performed by the terminal and the user, but also to the process of multiple rounds of voice conversation between the terminal and the user.
  • the single-round voice dialogue described in the embodiment of the present application refers to a voice dialogue in which the user and the terminal adopt a question-and-answer mode.
  • the user when the user inputs a voice information (such as voice information a) to the terminal, when the terminal also responds to the voice information a to reply to the user, the user inputs another voice information (such as voice information b). ).
  • the terminal receives the voice information a and the voice information b almost simultaneously, the terminal simultaneously processes the voice information a and the voice information b.
  • the text information a corresponding to the voice information a is "I am going to Cambodia"
  • the text information b corresponding to the voice information b is "How is the weather today?"
  • the embodiment of the present application refers to a multi-round voice conversation in a dialogue process in which the terminal receives and processes the voice information a and the voice information b.
  • the terminal may convert the voice information a and the voice information b into text information, and perform semantic understanding on the text information a and the text information b.
  • the terminal may consider that the previous voice information of the voice information b and the previous language information of the voice information a are the same. . Since the prior probability depends on the event field corresponding to the previous voice information, the prior information of the text information a and the text information b belonging to an event field are the same. For example, as shown in FIG. 14, the prior information P1 of the text information a and the text information b belonging to the music field is 40%, and the prior probability P1 of the text information a and the text information b belonging to the setting field is 30%. The a priori probability P1 at which the text information a and the text information b belong to the APP field are both 30%.
  • the text information a and the text information b can be calculated separately.
  • the domain information probability P2-a of the text information a belongs to the music field is 40%
  • the domain probability P2-a of the text information a belongs to the setting field is 20%
  • the domain probability P2-a of the text information a belongs to the APP domain is 40%.
  • the domain information probability P2-b of the text information b belonging to the music field is 20%
  • the domain probability P2-b of the text information b belonging to the setting domain is 10%
  • the domain probability P2-b of the text information b belonging to the APP domain is 70%. %.
  • the confidence level P3-a of the text information a belongs to the music field is 10%
  • the confidence P3-a of the text information a belongs to the setting field is 10%
  • the confidence P3-a of the text information a belongs to the APP domain is 80%.
  • the text information b belongs to the music field with a confidence level P3-b of 60%
  • the text information b belongs to the set field with a confidence level P3-b of 30%
  • the text information b belongs to the APP field with a confidence level P3-b of 10 %.
  • the dialogue engine in the music field, the dialogue engine in the setting domain, and the dialogue engine in the APP domain can perform semantic understanding on the text information a and the text information b, respectively, to obtain a semantic understanding result.
  • the probability value of information a belonging to the APP domain is 9.6% greater than the probability value belonging to the music domain by 1.6%, and also greater than the probability value belonging to the set domain by 0.6%; therefore, the terminal can output the dialogue engine of the APP domain to semantically the text information a. Understand the resulting semantic understanding of the results.
  • the terminal can output the dialogue engine of the music domain to the text information b
  • the semantic understanding results obtained by semantic understanding.
  • the embodiment of the present invention provides a voice information processing method, which can be applied not only to a single-round voice conversation process performed by a terminal and a user, but also to a process of multiple round voice conversations performed by the terminal and the user.
  • the method provided by the embodiment of the present application can improve the accuracy of the selected event field in the process of applying a single-round voice dialogue between the terminal and the user, or in the process of multiple rounds of voice conversation between the terminal and the user.
  • the accuracy of the semantic understanding result can be improved, so that the compliance between the terminal execution event and the user-entered voice information indicating the event performed by the terminal can be improved, and the user experience can be improved.
  • the above terminal and the like include hardware structures and/or software modules corresponding to each function.
  • the embodiments of the present invention can be implemented in a combination of hardware or hardware and computer software in combination with the elements and algorithm steps of the various examples described in the embodiments disclosed herein. Whether a function is implemented in hardware or computer software to drive hardware depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each particular application, but such implementation should not be considered to be beyond the scope of the embodiments of the invention.
  • the embodiment of the present application may perform the division of the function modules on the terminal or the like according to the foregoing method example.
  • each function module may be divided according to each function, or two or more functions may be integrated into one processing module.
  • the above integrated modules can be implemented in the form of hardware or in the form of software functional modules. It should be noted that the division of the module in the embodiment of the present invention is schematic, and is only a logical function division, and the actual implementation may have another division manner.
  • FIG. 15 is a schematic diagram showing a possible structure of a voice information processing apparatus in a terminal involved in the foregoing embodiment, where the voice information processing apparatus 1500 includes: a receiving unit, in a case where each function module is divided by a corresponding function. 1501, a conversion unit 1502, a first acquisition unit 1503, a second acquisition unit 1504, a third acquisition unit 1505, a calculation unit 1506, and an output unit 1507.
  • the receiving unit 1501 is configured to support the terminal to perform the operation of “receiving voice information” in S401 in the method embodiment, and/or other processes used in the techniques described herein.
  • the above-described converting unit 1502 is for supporting the terminal to perform the operation of "converting voice information into text information" in S401 in the embodiment of the method, and/or other processes for the techniques described herein.
  • the first obtaining unit 1503 is configured to support the terminal to perform S402, S1001-S1003, S1101-S1103, and/or other processes for the techniques described herein.
  • the second obtaining unit 1504 is configured to support the terminal to perform S403 in the method embodiment, and/or other processes for the techniques described herein.
  • the third obtaining unit 1505 is configured to support the terminal to execute S404, S404a-S404c in the method embodiment, and/or other processes for the techniques described herein.
  • the above computing unit 1506 is configured to support the terminal to perform S405 in the method embodiment, and/or other processes for the techniques described herein.
  • the output unit 1507 is configured to support the terminal to perform S406 in the method embodiment, and/or used in the description herein. Other processes of the described techniques.
  • the voice information processing apparatus 1500 may further include: a semantic understanding unit.
  • the semantic understanding unit is for supporting the terminal to perform S406' in the method embodiment, and/or other processes for the techniques described herein.
  • the voice information processing apparatus 1500 may further include: a storage unit.
  • the storage unit is configured to store information such as a keyword model and a database model described in the method embodiments.
  • the voice information processing apparatus 1500 may further include: an execution unit.
  • the execution unit is for supporting the terminal to perform S407 in the method embodiment, and/or other processes for the techniques described herein.
  • the voice information processing apparatus 1500 includes, but is not limited to, the above-mentioned unit modules.
  • the voice information processing apparatus 1500 may further include a fourth obtaining unit 1508, which is used to support terminal execution. S402' in method embodiments, and/or other processes for the techniques described herein.
  • the functions that can be implemented by the foregoing functional units include, but are not limited to, the functions corresponding to the method steps described in the foregoing examples.
  • the functions that can be implemented by the foregoing functional units include, but are not limited to, the functions corresponding to the method steps described in the foregoing examples.
  • the voice information processing apparatus 1500 For detailed descriptions of other units of the voice information processing apparatus 1500, reference may be made to the detailed description of the corresponding method steps. The embodiments of the present application are not described herein again.
  • the above semantic understanding unit may correspond to one or more dialog engines in the dialog engine layer 202 shown in FIG. 2 .
  • the above conversion unit 1502 may correspond to the domain identification module 2012 shown in FIG. 2.
  • the functions of the first obtaining unit 1503, the second obtaining unit 1504, the third obtaining unit 1505, and the calculating unit 1506 may be integrated into the DS module 2014 shown in FIG. 2. It can be understood that the functions of the semantic understanding unit, the conversion unit 1502, the first acquisition unit 1503, the second acquisition unit 1504, the third acquisition unit 1505, the calculation unit 1506, and the like can be integrated into one processing module, and the processing module can be implemented.
  • Is a processor or a controller may be a central processing unit (CPU), a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), Field Programmable Gate Array (FPGA) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. It is possible to implement or carry out the various illustrative logical blocks, modules and circuits described in connection with the present disclosure.
  • the processor may also be a combination of computing functions, for example, including one or more microprocessor combinations, a combination of a DSP and a microprocessor, and the like.
  • the above receiving unit 1501 and output unit 1507 may correspond to VSI 2011 shown in FIG. 2.
  • the VSI 2011 can be an interface to the terminal's processor.
  • the above storage unit may be a storage module for storing algorithms, rules, and the like in the algorithm layer 203 shown in FIG. 2.
  • the storage module can be a memory.
  • FIG. 17 shows a possible structural diagram of the terminal involved in the above embodiment.
  • the terminal 1700 includes a processing module 1701 and a storage module 1702.
  • the storage module 1702 is configured to save program codes and data (such as algorithms and rules, etc.) of the terminal.
  • the processing module 1701 is configured to execute the voice information processing method described in the embodiment of the program code execution method saved by the storage module 1702.
  • the terminal 1700 can further include a communication module 1703 for supporting communication between the terminal and other network entities.
  • the communication module 1703 can be a transceiver, a transceiver circuit, a communication interface, or the like.
  • the storage module 1702 can be a memory.
  • the processing module 1701 is a processor (such as the processor 101 shown in FIG. 1)
  • the communication module 1703 is an RF receiver.
  • the terminal provided by the embodiment of the present invention may be the terminal 100 shown in FIG. 1 when the memory module (such as the radio frequency circuit 102 shown in FIG. 1) and the memory module 1702 is the memory (such as the memory 103 shown in FIG. 1).
  • the communication module 1703 may include not only a radio frequency circuit but also a WiFi module and a Bluetooth module. Communication modules such as radio frequency circuits, WiFi modules, and Bluetooth modules can be collectively referred to as communication interfaces.
  • the terminal of the embodiment of the present application may include one or more processors and one or more memories, and the one or more processors, one or more memories, and communication interfaces may be coupled together by a bus.
  • the embodiment of the present application provides an electronic device including the voice information processing apparatus 1500 for performing the voice information processing method in the above embodiment.
  • the embodiment of the present application further provides a computer storage medium, where the computer program code stores a computer program code, and when the processor executes the computer program code, the electronic device performs FIG. 3, FIG. 4, FIG. 6, FIG.
  • the related method steps in any of the drawings of Fig. 11 implement the voice information processing method in the above embodiment.
  • the embodiment of the present application further provides a computer program product, when the computer program product is run on an electronic device, causing the electronic device to perform the operations in any of FIG. 3, FIG. 4, FIG. 6, FIG. 10, and FIG.
  • the related method steps implement the voice information processing method in the above embodiment.
  • the voice information processing apparatus 1500, the terminal 1700, the computer storage medium, or the computer program product provided by the embodiments of the present invention are all used to execute the corresponding method provided above. Therefore, the beneficial effects that can be achieved can be referred to the above. The beneficial effects in the corresponding methods provided are not described here.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be another division manner for example, multiple units or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the embodiments of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the form of the product is embodied in a storage medium comprising instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform various embodiments of the present application. All or part of the steps of the method.
  • the foregoing storage medium includes: a flash memory, a mobile hard disk, a read only memory, a random access memory, a magnetic disk, or an optical disk, and the like, which can store program codes.
  • the disclosed system, apparatus, and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division.
  • there may be another division manner for example, multiple units or components may be used. Combinations can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • a computer readable storage medium A number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to perform all or part of the steps of the methods described in various embodiments of the present application.
  • the foregoing storage medium includes: a flash memory, a mobile hard disk, a read only memory, a random access memory, a magnetic disk, or an optical disk, and the like, which can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephone Function (AREA)
  • Machine Translation (AREA)

Abstract

本申请实施例提供一种语音信息处理方法、装置及终端,涉及计算机技术领域,可以提高终端执行语义理解结果对应的事件的效率,并节省进行语义理解消耗的网络流量。具体方案包括:终端接收语音信息,将该语音信息转换为文本信息;获取文本信息归属于预设M个事件领域中的每个事件领域的领域概率;获取文本信息归属于N个事件领域中的每个事件领域的先验概率,N≤M;获取文本信息归属于N个事件领域中的每个事件领域的置信度;根据文本信息归属于N个事件领域中的每个事件领域的领域概率、先验概率和置信度,计算文本信息分别归属于N个事件领域的N个概率值;输出根据N个概率值中概率值最高的事件领域对文本信息进行语义理解的语义理解结果。

Description

一种语音信息处理方法、装置及终端
本申请要求于2017年10月09日提交中国专利局、申请号为201710931504.9、申请名称为“一种语音信息处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,尤其涉及一种语音信息处理方法、装置及终端。
背景技术
随着电子技术的发展,智能终端的功能越来越多,如终端可以为用户提供语音对话功能,即终端可以接收用户输入的语音信息(如“打开地图应用”),对该语音信息进行语义理解,然后执行语义理解结果对应的事件(开启终端中的地图应用,如百度地图)。
一般而言,由于终端的处理能力有限,终端可以将接收的语音信息发送至云端服务器,由云端服务器对该语音信息进行语义理解,得到语义理解结果;然后,云端服务器可以指示终端执行语义理解结果对应的事件。
但是,在上述语音识别的过程中,需要终端与云端服务器进行至少两次数据交互,而终端与云端服务器的数据交互可能会因为网络故障等原因,造成终端不能及时执行语义理解结果对应的事件。并且,由于语音信息的数据量一般较大,因此会消耗大量的网络流量。
发明内容
本申请实施例提供一种语音信息处理方法、装置及终端,可以节省由云端服务器进行语义理解时消耗的网络流量。
第一方面,本申请实施例提供一种语音信息处理方法,该语音信息处理方法包括:终端接收语音信息,将该语音信息转换为文本信息;该终端中预设M个事件领域;获取上述文本信息归属于M个事件领域中的每个事件领域的领域概率,上述文本信息归属于一个事件领域的领域概率用于表征该文本信息归属于该事件领域的可能性;获取上述文本信息归属于N个事件领域中的每一个事件领域的先验概率,上述文本信息归属于一个事件领域的先验概率用于表征根据已进行的多次语义理解,确定该文本信息归属于该事件领域的概率,上述N个事件领域为上述M个事件领域中的N个事件领域,N小于或等于M;获取上述文本信息归属于上述N个事件领域中的每个事件领域的置信度,上述文本信息归属于一个事件领域的置信度用于表征上述文本信息归属于该事件领域的确信程度;根据上述文本信息归属于上述N个事件领域中的每个事件领域的领域概率、先验概率和置信度,计算上述文本信息分别归属于所述N个事件领域的N个概率值;输出根据上述N个概率值中概率值最高的事件领域对文本信息进行语义理解的语义理解结果。可选的,所述输出根据上述N个概率值中概率值最高的事件 领域对文本信息进行语义理解的语义理解结果,可以被替换为:将根据上述N个概率值中概率值最高的事件领域对文本信息进行语义理解的语义理解结果作为最终的语义理解结果。
其中,文本信息归属于一个事件领域的先验概率:用于表征历史数据中,文本信息归属于该事件领域的概率;文本信息归属于一个事件领域的领域概率:用于表征该文本信息归属于该事件领域的可能性;文本信息归属于一个事件领域的置信度:用于表征该文本信息归属于该事件领域的确信程度。本申请实施例在选择处理文本信息的事件领域时,不仅参考了对文本信息中包括的词汇进行分析得到的领域概率,还参考了文本信息归属于事件领域的先验概率,以及用于表征该文本信息归属于该事件领域的确信程度的置信度;因此,可以提高选择的事件领域的准确性,进而可以提高语义理解结果的准确性,从而可以提高终端执行事件与用户输入的语音信息指示终端执行的事件的符合度,可以提高用户体验。
在一种可能的设计方法中,当N小于M时,上述N个事件领域是上述预设M个事件领域中,领域概率按照由高至低的顺序排列在前N位的N个事件领域,N≥2。具体的,终端可以按照领域概率由高至低的顺序,从上述M个事件领域中选择出领域概率排列在前N位的事件领域。
可以理解,终端从M个事件领域中选择出N个事件领域后,只需要计算上述文本信息归属于上述N个事件领域的先验概率和置信度,而不需要计算文本信息归属于M个事件领域中所有事件领域的先验概率和置信度,可以减少终端进行语音信息处理时的计算量,提高计算效率。
在另一种可能的设计方法中,在上述终端获取所述文本信息归属于M个事件领域中的每个事件领域的领域概率之后,本申请实施例的方法还包括:终端在上述N个事件领域,分别对上述文本信息进行语义理解,得到N个语义理解结果。
其中,终端在对文本信息进行领域识别后,可以将文本信息传输至识别到的事件领域的对话引擎,由对话引擎对该文本信息进行语义理解,得到语义理解结果。或者,本实施例可以不限定终端进行领域识别和语义理解的顺序,可以同时或基本同时进行领域识别和语义理解,也可以先进行语义理解后进行领域识别。
在另一种可能的设计方法中,上述M个事件领域中的每个事件领域对应于一个关键字模型,该关键字模型中包括:对应事件领域的多个关键字。具体的,上述终端获取上述文本信息归属于上述N个事件领域中的每个事件领域的置信度,可以包括:终端对所述文本信息进行分词处理,并提取至少一个分词;获取该至少一个分词对应的关键字在上述每个事件领域的关键字模型中的分布信息;根据该分布信息,计算上述文本信息归属于上述N个事件领域中的每个事件领域的置信度。
在另一种可能的设计方法中,上述终端获取上述文本信息归属于M个事件领域中的每个事件领域的领域概率,包括:终端对上述文本信息进行分词处理,并提取至少一个分词;从上述每个事件领域对应的数据库模型中查找上述至少一个分词对应的特征,上述数据库模型中包括多个特征、每个特征的权重及每个特征对应的分词,上述权重用于指示上述权重对应的特征归属于上述数据库模型中对应的事件领域的概率;其中,每个事件领域对应一个数据库模型;根据从上述每个事件领域对应的数据库模 型中查找到的特征的权重,计算上述文本信息归属于上述每个事件领域的领域概率。
其中,在上述特征数据库中,同一分词在不同事件领域的数据库模型中的特征相同,即在特征数据库中,分词的特征可以唯一标识该分词。但是,同一分词在不同事件领域中的权重不同。
在另一种可能的设计方法中,上述M个事件领域中的每个事件领域对应于一个关键字模型,上述关键字模型中包括:多个关键字和每个关键字指示文本信息归属于上述关键字模型对应的事件领域的概率。上述终端获取上述文本信息归属于M个事件领域中的每个事件领域的领域概率,包括:从上述文本信息中识别至少一个关键字;从上述每个事件领域对应的关键字模型中获取上述至少一个关键字分别指示的概率;根据上述至少一个关键字分别指示的概率,计算上述文本信息归属于上述每个事件领域的领域概率。
其中,由于至少一个关键字中可能包括各个事件领域的关键字模型中的关键字,而每个关键字在不同的事件领域的关键字模型中,可以指示上述文本信息归属于对应事件领域的概率;因此,根据文本信息中包括的各个事件领域的关键字所指示的概率,可以计算得到文本信息归属于各个事件领域的领域概率。
在另一种可能的设计方法中,本申请实施例的方法还可以包括:上述终端输出上诉后语义理解结果之后,根据上述语义理解结果,执行语义理解结果对应的操作。
第二方面,本申请实施例提供一种语音信息处理装置,该语音信息处理装置包括:接收单元、转换单元、第一获取单元、第二获取单元、第三获取单元、计算单元和输出单元。其中,上述接收单元,用于接收语音信息。上述转换单元,用于将上述接收单元接收的上述语音信息转换为文本信息;上述终端中预设M个事件领域。上述第一获取单元,用于获取上述转换单元转换得到的上述文本信息归属于M个事件领域中的每个事件领域的领域概率,上述领域概率用于表征上述文本信息归属于一个事件领域的可能性。上述第二获取单元,用于获取上述转换单元转换得到的上述文本信息归属于上述N个事件领域中的每一个事件领域的先验概率,上述先验概率用于表征根据已进行的多次语义理解,确定上述文本信息归属于一个事件领域的概率,上述N个事件领域为上述M个事件领域中的N个事件领域,N小于或等于M。上述第三获取单元,用于获取上述转换单元转换得到的上述文本信息归属于上述N个事件领域中的每个事件领域的置信度,上述置信度用于表征上述文本信息归属于一个事件领域的确信程度。上述计算单元,用于根据上述第一获取单元获取的上述文本信息归属于上述N个事件领域中的每个事件领域的领域概率、上述第二获取单元获取的先验概率和上述第三获取单元获取的置信度,计算上述文本信息分别归属于上述N个事件领域的N个概率值。上述输出单元,用于输出根据上述计算单元计算得到的上述N个概率值中概率值最高的事件领域对文本信息进行语义理解的语义理解结果。
在一种可能的设计方法中,当N小于M时,上述N个事件领域是上述预设M个事件领域中,领域概率按照由高至低的顺序排列在前N位的N个事件领域,N≥2。
在另一种可能的设计方法中,上述语音信息处理装置还包括:语义理解单元。该语义理解单元,用于在上述第一获取单元获取上述文本信息归属于M个事件领域中的每个事件领域的领域概率之后,在上述N个事件领域,分别对上述文本信息进行语义 理解,得到N个语义理解结果。
在另一种可能的设计方法中,上述语音信息处理装置还包括:存储单元。该存储单元,用于保存上述M个事件领域中的每个事件领域对应的关键字模型,上述关键字模型中包括:对应事件领域的多个关键字。上述第三获取单元,具体用于:对上述文本信息进行分词处理,并提取至少一个分词;获取上述至少一个分词对应的关键字在上述存储单元保存的上述每个事件领域的关键字模型中的分布信息;根据上述分布信息,计算上述文本信息归属于上述N个事件领域中的每个事件领域的置信度。
在另一种可能的设计方法中,上述第一获取单元,具体用于:对上述文本信息进行分词处理,并提取至少一个分词;从上述每个事件领域对应的数据库模型中查找上述至少一个分词对应的特征,上述数据库模型中包括多个特征、每个特征的权重及每个特征对应的分词,上述权重用于指示上述权重对应的特征归属于上述数据库模型中对应的事件领域的概率;其中,每个事件领域对应一个数据库模型;根据从上述每个事件领域对应的数据库模型中查找到的特征的权重,计算上述文本信息归属于上述每个事件领域的领域概率。
在另一种可能的设计方法中,上述语音信息处理装置还包括:存储单元。该存储单元,用于保存上述M个事件领域中的每个事件领域对应的关键字模型,上述关键字模型中包括:多个关键字和每个关键字指示文本信息归属于上述关键字模型对应的事件领域的概率。上述第一获取单元,具体用于:从上述文本信息中识别至少一个关键字;从上述每个事件领域对应的关键字模型中获取上述至少一个关键字分别指示的概率;根据上述至少一个关键字分别指示的概率,计算上述文本信息归属于上述每个事件领域的领域概率。
在另一种可能的设计方法中,上述语音信息处理装置还包括:执行单元。该执行单元,用于在上述输出单元输出上述语义理解结果之后,根据上述语义理解结果,执行上述语义理解结果对应的操作。
第三方面,本申请实施例提供一种终端,该终端包括:一个或多个处理器;一个或多个存储器,所述一个或多个存储器中存储有一个或多个计算机程序,所述一个或多个计算机程序包括指令,当所述指令被所述一个或多个处理器执行时,使得所述终端执行如第一方面及其任一种可能的设计方法所述的语音信息处理方法。
第四方面,本申请实施例提供一种电子设备,所述电子设备包括执行如第一方面及其任一种可能的设计方法所述的语音信息处理方法的装置。
第五方面,本申请实施例提供一种包含指令的计算机程序产品,当所述计算机程序产品在电子设备上运行时,使得所述电子设备执行如第一方面及其任一种可能的设计方法所述的语音信息处理方法。
第六方面,本申请实施例提供一种计算机可读存储介质,该计算机可读存储介质包括指令,当所述指令在电子设备上运行时,使得所述电子设备执行如第一方面及其任一种可能的设计方法所述的语音信息处理方法。
可以理解地,上述提供的第二方面所述的装置、第三方面所述的终端、第四方面所述的电子设备,第五方面所述计算机程序产品,以及第六方面所述的计算机存储介质均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所 提供的对应的方法中的有益效果,此处不再赘述。
附图说明
图1为本申请实施例提供的一种终端的硬件结构示意图;
图2为本申请实施例提供的一种用于进行语音信息处理的架构示意图;
图3为本申请实施例提供的一种语音信息处理方法流程图一;
图4为本申请实施例提供的一种语音信息处理方法流程图二;
图5为本申请实施例提供的一种语音信息处理方法的语义理解结果的历史记录示意图;
图6为本申请实施例提供的一种语音信息处理方法流程图三;
图7为本申请实施例提供的一种关键字数据库的实例示意图一;
图8为本申请实施例提供的一种语音信息处理方法的执行过程实例示意图一;
图9为本申请实施例提供的一种关键字数据库的实例示意图二;
图10为本申请实施例提供的一种语音信息处理方法流程图四;
图11为本申请实施例提供的一种语音信息处理方法流程图五;
图12为本申请实施例提供的一种特征数据库的实例示意图一;
图13为本申请实施例提供的一种特征数据库的实例示意图二;
图14为本申请实施例提供的一种语音信息处理方法的执行过程实例示意图二;
图15为本申请实例提供的一种语音信息处理装置的结构组成示意图一;
图16为本申请实例提供的一种语音信息处理装置的结构组成示意图二;
图17为本申请实例提供的一种终端的结构组成示意图。
具体实施方式
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
本申请实施例提供一种语音信息处理方法及终端,可以应用于终端与用户进行语音对话的过程中。具体应用于终端接收用户输入的语音信息,对该语音信息进行语义理解,并执行语义理解结果对应的事件的过程中。例如用户通过语音控制终端的过程中。
其中,本申请实施例中对语音信息进行语义理解可以包括:将语音信息转换为文本信息,然后分析该文本信息,识别出该文本信息所指示终端所执行的事件。例如,当终端接收到用户输入的语音信息“提醒我22:00打开飞行模式”时,终端可以将该语音信息转换成文本信息“提醒我22:00打开飞行模式”,然后识别出该文本信息所指示终端执行的事件为“在22:00向用户发出“打开飞行模式”的提醒”,而非直接“打开飞行模式”。
常规方案在上述语音识别的过程中,会出现以下问题:终端与云端服务器的数据交互可能会因为网络故障等原因,造成终端不能及时执行语义理解结果对应的事件; 由于语音信息的数据量一般较大,常规方案会消耗大量的网络流量。为了解决常规方案中的问题,本申请实施例提供的语音信息处理方法,可以由终端执行上述语义理解。
一般而言,终端在执行上述语义理解时,只是简单的对转换得到的文本信息中包括的词汇进行分析,确定出该文本信息所归属的一个事件领域,即判断出该文本信息所归属的事件领域,然后由该文本信息所归属的事件领域的对话引擎,采用该事件领域的语义理解算法对该文本信息进行语义理解,然后执行语义理解结果对应的事件。
但是,存在的问题是,简单的对文本信息中包括的词汇进行分析,确定出的事件领域可能并不准确;由不准确的事件领域的对话引擎,采用该不准确的事件领域的语义理解算法对该文本信息进行语义理解,得到的语义理解结果也不准确。由此,则可能会导致终端执行的语义理解结果对应的事件,与用户输入的语音信息指示终端执行的事件不同,影响用户体验。
本申请实施例中,为了在提高终端执行语义理解结果对应的事件的效率,节省由云端服务器进行语义理解时消耗的网络流量的同时,提高终端对语音信息进行语义理解的准确性,可以在将语音信息转换为文本信息后,根据终端进行语义理解的历史数据,获取该文本信息归属于每个事件领域的先验概率;文本信息归属于一个事件领域的先验概率用于表征历史数据中,文本信息归属于该事件领域的概率。然后,分析该文本信息,获取所述文本信息归属于每个事件领域的领域概率;该文本信息归属于一个事件领域的领域概率用于表征该文本信息归属于该事件领域的可能性。随后,终端可以计算文本信息归属于每个事件领域的置信度;文本信息归属于一个事件领域的置信度用于表征该文本信息归属于该事件领域的确信程度。其次,终端可以根据上述文本信息归属于一个事件领域的先验概率、领域概率和置信度,计算该文本信息归属于该事件领域的概率值;进而得到该文本信息归属于每个事件领域的概率值。最后,终端可以将概率值最高的事件领域的对话引擎对该文本信息进行语义理解,得到的语义理解结果作为该文本信息(即上述语音信息)的语义理解结果,终端可以执行该语义理解结果对应的事件。
其中,由于本申请实施例在选择处理文本信息的事件领域时,不仅参考了对文本信息中包括的词汇进行分析得到的领域概率,还参考了文本信息归属于事件领域的先验概率,以及用于表征该文本信息归属于该事件领域的确信程度的置信度;因此,可以提高选择的事件领域的准确性,进而可以提高语义理解结果的准确性,从而可以提高终端执行事件与用户输入的语音信息指示终端执行的事件的符合度,可以提高用户体验。
其中,本申请实施例中的终端可以是允许用户通过输入语音信息指示终端执行相关操作事件的手机(如图1所示的手机100)、平板电脑、个人计算机(Personal Computer,PC)、个人数字助理(personal digital assistant,PDA)、智能手表、上网本、可穿戴电子设备等,本申请实施例对该设备的具体形式不做特殊限制。
其中,本申请实施例中文本信息所归属的事件领域是指对该文本信息进行语义理解后,语义理解结果所指示终端执行的事件所归属的领域。例如,本申请实施例中的事件领域可以包括音乐领域、设置领域、应用程序(Application,APP)领域等。举例来说,“播放歌曲a”和“播放下一曲”等文本信息归属于音乐领域,“调低屏幕 亮度”和“打开飞行模式”等文本信息归属于设置领域,“打开微信”和“地图导航至A街道10号”等文本信息归属于APP领域。
如图1所示,以手机100作为上述终端举例,手机100具体可以包括:处理器101、射频(Radio Frequency,RF)电路102、存储器103、触摸屏104、蓝牙装置105、一个或多个传感器106、Wi-Fi装置107、定位装置108、音频电路109、外设接口110以及电源装置111等部件。这些部件可通过一根或多根通信总线或信号线(图1中未示出)进行通信。本领域技术人员可以理解,图1中示出的硬件结构并不构成对手机的限定,手机100可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图1对手机100的各个部件进行具体的介绍:
处理器101是手机100的控制中心,利用各种接口和线路连接手机100的各个部分,通过运行或执行存储在存储器103内的应用程序,以及调用存储在存储器103内的数据,执行手机100的各种功能和处理数据。在一些实施例中,处理器101可包括一个或多个处理单元。在本申请实施例一些实施例中,上述处理器101还可以包括指纹验证芯片,用于对采集到的指纹进行验证。
射频电路102可用于在收发信息或通话过程中,无线信号的接收和发送。特别地,射频电路102可以将基站的下行数据接收后,给处理器101处理;另外,将涉及上行的数据发送给基站。通常,射频电路包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器、双工器等。此外,射频电路102还可以通过无线通信和其他设备通信。所述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统、通用分组无线服务、码分多址、宽带码分多址、长期演进、电子邮件、短消息服务等。
存储器103用于存储应用程序以及数据,处理器101通过运行存储在存储器103的应用程序以及数据,执行手机100的各种功能以及数据处理。存储器103主要包括存储程序区以及存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等);存储数据区可以存储根据使用手机100时所创建的数据(比如音频数据、电话本等)。此外,存储器103可以包括高速随机存取存储器(RAM),还可以包括非易失存储器,例如磁盘存储器件、闪存器件或其他易失性固态存储器件等。存储器103可以存储各种操作系统,例如,
Figure PCTCN2017106168-appb-000001
操作系统,
Figure PCTCN2017106168-appb-000002
操作系统等。上述存储器103可以是独立的,通过上述通信总线与处理器101相连接;存储器103也可以和处理器101集成在一起。
触摸屏104具体可以包括触控板104-1和显示器104-2。
其中,触控板104-1可采集手机100的用户在其上或附近的触摸事件(比如用户使用手指、触控笔等任何适合的物体在触控板104-1上或在触控板104-1附近的操作),并将采集到的触摸信息发送给其他器件(例如处理器101)。其中,用户在触控板104-1附近的触摸事件可以称之为悬浮触控;悬浮触控可以是指,用户无需为了选择、移动或拖动目标(例如图标等)而直接接触触控板,而只需用户位于设备附近以便执行所想要的功能。此外,可以采用电阻式、电容式、红外线 以及表面声波等多种类型来实现触控板104-1。
显示器(也称为显示屏)104-2可用于显示由用户输入的信息或提供给用户的信息以及手机100的各种菜单。可以采用液晶显示器、有机发光二极管等形式来配置显示器104-2。触控板104-1可以覆盖在显示器104-2之上,当触控板104-1检测到在其上或附近的触摸事件后,传送给处理器101以确定触摸事件的类型,随后处理器101可以根据触摸事件的类型在显示器104-2上提供相应的视觉输出。虽然在图1中,触控板104-1与显示屏104-2是作为两个独立的部件来实现手机100的输入和输出功能,但是在某些实施例中,可以将触控板104-1与显示屏104-2集成而实现手机100的输入和输出功能。可以理解的是,触摸屏104是由多层的材料堆叠而成,本申请实施例实施例中只展示出了触控板(层)和显示屏(层),其他层在本申请实施例实施例中不予记载。另外,触控板104-1可以以全面板的形式配置在手机100的正面,显示屏104-2也可以以全面板的形式配置在手机100的正面,这样在手机的正面就能够实现无边框的结构。
另外,手机100还可以具有指纹识别功能。例如,可以在手机100的背面(例如后置摄像头的下方)配置指纹识别器112,或者在手机100的正面(例如触摸屏104的下方)配置指纹识别器112。又例如,可以在触摸屏104中配置指纹采集器件112来实现指纹识别功能,即指纹采集器件112可以与触摸屏104集成在一起来实现手机100的指纹识别功能。在这种情况下,该指纹采集器件112配置在触摸屏104中,可以是触摸屏104的一部分,也可以以其他方式配置在触摸屏104中。本申请实施例实施例中的指纹采集器件112的主要部件是指纹传感器,该指纹传感器可以采用任何类型的感测技术,包括但不限于光学式、电容式、压电式或超声波传感技术等。
手机100还可以包括蓝牙装置105,用于实现手机100与其他短距离的设备(例如手机、智能手表等)之间的数据交换。本申请实施例实施例中的蓝牙装置可以是集成电路或者蓝牙芯片等。
手机100还可以包括至少一种传感器106,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节触摸屏104的显示器的亮度,接近传感器可在手机100移动到耳边时,关闭显示器的电源。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机100还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
无线保真(Wireless Fidelity,Wi-Fi)装置107,用于为手机100提供遵循Wi-Fi相关标准协议的网络接入,手机100可以通过Wi-Fi装置107接入到Wi-Fi接入点,进而帮助用户收发电子邮件、浏览网页和访问流媒体等,它为用户提供了无线的宽带互联网访问。在其他一些实施例中,该Wi-Fi装置107也可以作为Wi-Fi无线接入点,可以为其他设备提供Wi-Fi网络接入。
定位装置108,用于为手机100提供地理位置。可以理解的是,该定位装置108具体可以是全球定位系统(Global Positioning System,GPS)或北斗卫星导航系统、俄罗斯GLONASS等定位系统的接收器。定位装置108在接收到上述定位系统发送的地理位置后,将该信息发送给处理器101进行处理,或者发送给存储器103进行保存。在另外的一些实施例中,该定位装置108还可以是辅助全球卫星定位系统(Assisted Global Positioning System,AGPS)的接收器,AGPS系统通过作为辅助服务器来协助定位装置108完成测距和定位服务,在这种情况下,辅助定位服务器通过无线通信网络与设备例如手机100的定位装置108(即GPS接收器)通信而提供定位协助。在另外的一些实施例中,该定位装置108也可以是基于Wi-Fi接入点的定位技术。由于每一个Wi-Fi接入点都有一个全球唯一的媒体访问控制(Media Access Control,MAC)地址,设备在开启Wi-Fi的情况下即可扫描并收集周围的Wi-Fi接入点的广播信号,因此可以获取到Wi-Fi接入点广播出来的MAC地址;设备将这些能够标示Wi-Fi接入点的数据(例如MAC地址)通过无线通信网络发送给位置服务器,由位置服务器检索出每一个Wi-Fi接入点的地理位置,并结合Wi-Fi广播信号的强弱程度,计算出该设备的地理位置并发送到该设备的定位装置108中。
音频电路109、扬声器113、麦克风114可提供用户与手机100之间的音频接口。音频电路109可将接收到的音频数据转换后的电信号,传输到扬声器113,由扬声器113转换为声音信号输出;另一方面,麦克风114将收集的声音信号转换为电信号,由音频电路109接收后转换为音频数据,再将音频数据输出至RF电路102以发送给比如另一手机,或者将音频数据输出至存储器103以便进一步处理。
外设接口110,用于为外部的输入/输出设备(例如键盘、鼠标、外接显示器、外部存储器、用户识别模块卡等)提供各种接口。例如通过通用串行总线(Universal Serial Bus,USB)接口与鼠标连接,通过用户识别模块卡卡槽上的金属触点与电信运营商提供的用户识别模块卡(Subscriber Identification Module,SIM)卡进行连接。外设接口110可以被用来将上述外部的输入/输出外围设备耦接到处理器101和存储器103。
在本发明实施例中,手机100可通过外设接口110与设备组内的其他设备进行通信,例如,通过外设接口110可接收其他设备发送的显示数据进行显示等,本发明实施例对此不作任何限制。
手机100还可以包括给各个部件供电的电源装置111(比如电池和电源管理芯片),电池可以通过电源管理芯片与处理器101逻辑相连,从而通过电源装置111实现管理充电、放电、以及功耗管理等功能。
尽管图1未示出,手机100还可以包括摄像头(前置摄像头和/或后置摄像头)、闪光灯、微型投影装置、近场通信(Near Field Communication,NFC)装置等,在此不再赘述。
图2为本申请实施例提供的一种用于进行语音信息处理的架构示意图,该架构位于终端中。如图2所示,该架构包括中控层201、对话引擎层202和算法层203。
其中,中控层201包括:语音服务接口(Voice Service Interface,VSI)2011、 领域识别模块2012、调度分发模块2013和汇总决策(Decision Summary,DS)模块2014。
对话引擎层202中包括至少两个对话引擎。例如,如图2所示,对话引擎层202中包括对话引擎1、对话引擎2和对话引擎3。
算法层203包括:“模型和算法库”2031、规则(Rule)库2032、兴趣点(Points Of Interests,POI)库2033和状态模型2034。
中控层201用于通过VSI 2011接收语音信息(如从第三方应用接收语音信息),然后将接收到语音信息传输至领域识别模块2012。
请参考图2和图3,领域识别模块2012用于将接收到的语音信息转换为文本信息,并对该文本信息进行初步领域识别,识别出该文本信息可能的至少两个事件领域,然后将识别结果传输至调度分发模块2013;其中,领域识别模块2012可以调度算法层203中的“模型和算法库”2031、规则(Rule)库2032、POI库2033和状态模型2034,对上述文本信息进行领域识别。
其中,“模型和算法库”2031中可以包括多个算法(也称之为模型),这多个算法用于支持领域识别模块2012和对话引擎层202中的对话引擎(如对话引擎1)对文本信息进行分析。举例来说,如图2所示,算法层203中的“模型和算法库”2031中包括:逻辑回归/支持向量机(Logistic Regression/Support Vector Machine,LR/SVM)算法、词频-逆向文件频率(Term Frequency–Inverse Document Frequency,TF-IDF)算法、N-Gram/WS(Word Segment、分词)算法、语义角色标注(Semantic Role Label,SRL)算法、词性标注(Part of Speech,POS)算法、命名实体识别(Named Entity Recognition,NER)算法、条件随机场(Conditional Random Field,CRF)算法、统计机器翻译(Statistic Machine Translation、SMT)算法、深度强化学习网络(Deep Reinforce learning Network,DRN)算法、卷积/循环神经网络(Convolution/Recurrent Neural Net,C/RNN)算法和长短记忆网络(Long Short Term Memory,LSTM)算法等算法。其中,N-Gram是大词汇连续语音识别中常用的一种语言模型,对中文而言,可以称之为汉语语言模型(Chinese Language Model,CLM)。该汉语语言模型可以利用语音信息的上下文中相邻词间的搭配信息,可以实现语音信息到汉字(即文本信息)的自动转换。
算法层203中的Rule库2032中可以包括归属于各个事件领域的文本信息的语义理解规则。例如,如图2所示,Rule库2032可以包括归属于APP领域的文本信息的语义理解规则、归属于设置领域的文本信息的语义理解规则和归属于音乐领域的文本信息的语义理解规则等。其中,Rule库2032中一个事件领域的语义理解规则可以用于指示对归属于该事件领域的文本信息进行语义理解时,从“模型和算法库”2031中要调用的算法。举例来说,Rule库2032中APP领域的语义理解规则可以用于指示对归属于APP领域的文本信息进行语义理解时,可以从“模型和算法库”中调用LR/SVM算法和F-IDF算法。
以导航领域为例,POI库2033可以是包括使用Rule库2032中的规则的对象名称(如餐馆名称、学校名称等)、对象地址(如餐馆地址、学校地址等)、经纬度、类别(如学校、餐馆、政府机关、商场)等信息的数据集合。例如,POI库2033中可以包括Rule库2032中归属于音乐领域的文本信息的语义理解规则中的歌手和歌曲等。 其中,POI库2033中按照不同的地址可以维护多个数据集合;或者,按照不同的类别可以维护多个数据集合。
状态模型2034是对话引擎层202中的对话引擎管理对话状态的模型,该状态模型2034可以是自定义的模型,如确定性模型,概率模型,马尔科夫模型等。状态模型2034可以在终端与用户对话过程中,提供不同对话状态之间的转移。例如,概率模型是指用户输入语音信息后,该语音信息对应的文本信息归属于导航领域的概率值大于预设值,则输入在导航领域对文本信息进行语义理解的结果。
其中,调度分发模块2013用于将上述文本信息分发给上述识别结果所指示的至少两个事件领域所对应的对话引擎(如对话引擎1),由对应的对话引擎分别对该文本信息进行自然语言理解(Natural Language Understanding,NLU)、对话管理(Dialogue Management,DM)和自然语言处理(Natural Language Process,NLP)(即对话生成),以得到上述文本信息在对应领域的语音理解结果。
如图2所示,对话引擎层202中的每个对话引起对应于一个事件领域。例如,对话引擎1对应于设置领域,对话引擎2对应于APP领域,对话引擎3对应于音乐领域。以对话引擎1为例,各个事件领域的对话引擎中都可以包括:NLU模块、DM模块和NLP模块,用于对文本信息进行语义理解,得到语音理解结果。其中,各个对话引擎可以调用算法层203中的模型和算法库2031、Rule库2032、POI库2033中与该对话引擎对应的模型、算法和规则等,对上述文本信息进行语义理解。
随后,各个对话引擎可以将其得到的语义理解结果传输至DS模块2014,由DS模块2014执行本申请实施例中的方法步骤,从上述多个对话引擎反馈的语义理解结果中,选择出文本信息的归属概率值最高的事件领域对应的语义理解结果(即如图3所示,最优的语义理解结果),然后将选择出的语义理解结果作为上述文本信息的语义理解结果,通过上述VSI接口反馈该语义理解结果。
可以理解,图2所示的中控层201和对话引擎层,以及一部分算法层203的功能可以集成在图1所示的手机100的处理器101中实现,图2所示的算法层203中的算法和规则等信息可以保存在图1所示的手机100的存储器103中。即图2所示的用于进行语音信息处理的架构可以位于图1所示的手机100中。
本申请实施例提供一种语音信息处理方法,如图4所示,该语音识别方法包括S401-S406:
S401、终端接收语音信息,将该语音信息转换为文本信息;该终端中预设M个事件领域。
本申请实施例中,终端可以在接收到语音信息后,调用图2所示的算法层203中用于进行语音文本转换的算法(如N-Gram/WS算法),将语音信息转换为文本信息。或者,终端还可以通过调用语音文本转换(voice-to-text,speech-to-text)程序,将语音信息转换为文本信息。本申请实施例中,对将该语音信息转换为文本信息的具体方式不作限定。其中,终端接收的语音信息,一般指的是用户发出的语音;即用户发出语音,然后终端接收到该语音后,执行后续的动作。
其中,终端中可以预设M个事件领域,如音乐领域、设置领域、应用程序APP领域等。其中,M≥2。
S402、终端获取文本信息归属于M个事件领域中的每个事件领域的领域概率,该领域概率用于表征文本信息归属于一个事件领域的可能性。
其中,终端可以通过图2所示的中控层201中的领域识别模块2012,调用算法层203中用于对文本信息进行语义语法分析的算法,获取文本信息归属于N个事件领域中的各个事件领域的领域概率。其中,文本信息归属于一个事件领域的领域概率越高,该文本信息归属于该事件领域的可能性则越高。
可选的,终端可以在执行S402之后,终端可以从上述M个事件领域中获取文本信息对应的N个事件领域。具体的,如图4所示,在S402之后,本申请的方法还可以包括S402':
S402'、终端从上述M个事件领域中获取文本信息对应的N个事件领域。
需要说明的是,终端在执行S402之后,如果执行S402'从上述M个事件领域中获取文本信息对应的N个事件领域,那么本申请实施例中的N个事件领域则是上述M个事件领域中的部分事件领域,N<M。在这种情况下,上述N个事件领域是上述预设M个事件领域中,领域概率按照由高至低的顺序排列在前N位的N个事件领域,N≥2。即终端在执行S402后,可以按照领域概率由高至低的顺序,从上述M个事件领域中选择出领域概率排列在前N位的事件领域。示例性的,假设终端中预先有4个事件领域(即M=4):事件领域1、事件领域2、事件领域3和事件领域4,N=3。其中,上述文本信息归属于事件领域1的领域概率为50%,上述文本信息归属于事件领域2的领域概率为25%,上述文本信息归属于事件领域3的领域概率为10%,上述文本信息归属于事件领域4的领域概率为15%。由于50%>25%>15%>10%;因此,终端可以从上述事件领域1-事件领域4中选择出领域概率按照由高至低的顺序排列在前3位的3个事件领域,即事件领域1、事件领域2和事件领域4。
如此,终端在执行S403-S405时,只需要计算上述文本信息归属于上述N个事件领域的先验概率和置信度,而不需要计算文本信息归属于M个事件领域中所有事件领域的先验概率和置信度,可以减少终端进行语音信息处理时的计算量,提高计算效率。
当然,在本申请实施例中S402'是可选的,终端也可以不执行S402',在这种情况下,本申请实施例中的N个事件领域即上述M个事件领域,N=M。
本申请中,无论N=M或者N<M,在S402或者S402'之后,本申请的方法都可以包括S403-S406:
S403、终端获取上述文本信息归属于所述N个事件领域中的每一个事件领域的先验概率,该先验概率用于表征根据已进行的多次语义理解,确定上述文本信息归属于一个事件领域的概率,该N个事件领域为M个事件领域中的N个事件领域,N小于或等于M。
其中,一个事件领域的先验概率用于表征根据已进行的多次语义理解,确定该文本信息归属于该事件领域的概率。终端可以根据以往进行多次语义理解的历史数据,分别获取上述文本信息归属于上述N个事件领域中的每一个事件领域的先验概率。
示例性的,本申请这里以终端获取上述文本信息归属于上述N个事件领域中的第一事件领域的先验概率为例,对终端获取上述文本信息归属于N个事件领域中的各个事件领域的先验概率的方法进行举例说明。其中,第一事件领域可以是上述N个事件 领域中的任一事件领域。
在一种可能的实现方式中,终端可以统计该终端进行语义理解的总次数X;统计X次语义理解中、指示终端执行的事件归属于上述第一事件领域的语义理解结果的个数y;计算语义理解结果的个数y与语义理解的总次数X的比值y/X,该y/X是第一事件领域的先验概率。其中,终端统计的语义理解的总次数X是指终端已进行过的所有语义理解的总次数。此处的“所有语义理解”并不限定语义理解的对象,即包括终端对任意文本信息进行的语义理解。
举例来说,假设终端中预先设置了三个事件领域,如音乐领域、设置领域和APP领域。该终端进行语义理解的总次数为P(即X=P),这P次语义理解中,a个语义理解结果指示终端执行的事件归属于音乐领域(简称音乐领域的语义理解),b次语义理解结果指示终端执行的事件归属于设置领域(简称设置领域的语义理解),c次语义理解结果指示终端执行的事件归属于APP领域(简称APP领域的语义理解),a+b+c=P。那么,终端可以确定待处理的文本信息(如文本信息K)归属于音乐领域的先验概率为a/P,文本信息K归属于设置领域的先验概率为b/P,文本信息K归属于APP领域的先验概率为c/P。
一般而言,终端进行的相邻两次语义理解的对象(即文本信息)归属于同一事件领域的可能性较高,或者前一次语义理解的对象归属于一个事件领域,可能会影响后一次语义理解的对象归属于哪一事件领域产生影响。例如,在用户使用手机导航的场景中,手机前一次接收用户输入的语音信息可以是“导航”,随后,该手机接收到用户输入的语音信息是“去x街道100号”或者其他地点信息的可能性较高。其中,“导航”和“去x街道100号”都可以归属于上述APP领域,用于指示手机调用地图APP执行相应事件。
基于上述现象,在一种可能的实现方式中,终端获取文本信息K归属于上述N个事件领域中的每一个事件领域的先验概率时,如果前一次语义理解的对象归属于事件领域A。那么,终端则可以确定文本信息K归属于事件领域A的先验概率为a,a>0.5,文本信息K归属于其他任一事件领域的先验概率均为(1-a)/(N-1)。
示例性的,假设N=3,上述N个事件领域包括音乐领域、设置领域和APP领域;前一次语义理解的对象归属于音乐领域。那么终端则可以确定文本信息K归属于音乐领域的先验概率为0.8,文本信息K归属于设置领域的先验概率为(1-0.8)/(3-1)=0.1,文本信息K归属于APP领域的先验概率也为0.1。
基于上述现象,在另一种可能的实现方式中,终端在计算文本信息K归属于第一事件领域(如事件领域P)的先验概率时,可以参考终端前一次进行语义理解的文本信息所归属的事件领域(记为事件领域Q),然后统计该终端已进行过的所有语义理解中,相邻两次的被语义理解的文本信息依次归属于事件领域Q和事件领域P的概率(即按照时间先后顺序,事件领域Q在前,事件领域P在后),并将该概率确定为该文本信息K归属于第一事件领域(如事件领域P)的先验概率。
举例来说,假设终端中预先设置了三个事件领域,如音乐领域、设置领域和APP领域,该终端共进行过Y次语义理解。并且,如图5所示,这Y次语义理解的语义理解结果指示终端执行的事件所归属的事件领域依次为:设置领域、APP领域、设置领 域、音乐领域、音乐领域、设置领域、APP领域、设置领域……设置领域。
假设本次语义理解(对文本信息K进行语义理解)的前一次语义理解是设置领域的语义理解。如图5所示,终端统计上述Y次语义理解中,设置领域的语义理解的次数P。终端获取这P次设置领域的语义理解中,每次设置领域相邻的后一次语义理解对应的事件领域;统计出前一次语义理解是设置领域、相邻的后一次语义理解是设置领域的次数为a,前一次语义理解是设置领域、相邻的后一次语义理解是音乐领域的次数为b,前一次语义理解是设置领域、相邻的后一次语义理解是APP领域的次数为c,a+b+c=P。那么,终端可以确定文本信息K归属于设置领域的先验概率为a/P,文本信息K归属于音乐领域的先验概率为b/P,文本信息K归属于APP领域的先验概率为c/P。其中,本实施例中出现的前一次、后一次,指的是根据时间的先后顺序,先发生的为前一次,后发生的为后一次。
需要说明的是,本申请实施例中,终端获取文本信息归属于上述N个事件领域中的每个事件领域的先验概率的方法包括但不限于上述方法,终端获取文本信息归属于每个事件领域的先验概率的其他方法,本申请实施例这里不再赘述。
S404、终端获取上述文本信息归属于上述N个事件领域中的每个事件领域的置信度,该置信度用于表征上述文本信息归属于一个事件领域的确信程度。
本申请实施例中,终端可以针对上述M个事件领域中的每一个事件领域,保存一个关键字模型,每一个事件领域的关键字模型中包括该事件领域的多个关键字,该多个关键字是该事件领域中常用的词语和短句等。终端可以对上述文本信息进行分词处理,并提取至少一个分词,然后根据该至少一个分词对应的关键字在上述多个事件领域的关键字模型中的分布情况,计算该文本信息归属于该预设多个事件领域中每个事件领域的置信度。具体的,如图6所示,图4所示的S404可以包括S404a-S404c:
S404a、终端对上述文本信息进行分词处理,并提取至少一个分词。
其中,终端可以通过图2所示的中控层201中的领域识别模块2011,调用算法层203对文本信息进行分词处理,并提取至少一个分词。例如,假设上述文本信息为“播放歌手A的歌曲B”,终端可以对该文本信息进行分词处理,并提取出如下分词:“播放”、“歌手A”、“歌曲B”。假设终端在播放歌手A的歌曲的过程中,接收到文本信息“帮我调低音量”,终端可以对该文本信息进行分词处理,并提取出如下分词:“帮”、“我”、“调低”和“音量”。
S404b、终端获取上述至少一个分词对应的关键字在上述每个事件领域的关键字模型中的分布信息。
例如,本申请实施例的终端中可以维护一个如图7所示的关键字数据库701,该关键字数据库701中可以包括多个事件领域的关键字模型。假设终端中预先设置了两个事件领域,如音乐领域和设置领域。如图7所示,关键字数据库701中包括音乐领域的关键字模型702和设置领域的关键字模型703。其中,音乐领域的关键字模型702中包括音乐领域的多个关键字,如播放、下一曲、播放、歌手、摇滚和歌曲等。设置领域的关键字模型703中包括设置领域的多个关键字,如飞行模式、蓝牙、亮度、音量和调低等。
S404c、终端根据所述分布信息,计算所述文本信息归属于所述N个事件领域中 的每个事件领域的置信度。
其中,文本信息归属于第一事件领域的置信度用于表征该文本信息归属于该第一事件领域的确信程度。
例如,假设文本信息1为“播放歌手A的歌曲B”,终端提取的至少一个分词为“播放”、“歌手A”和“歌曲B”。终端可以确定出分词“播放”对应的关键字播放、分词“歌手A”对应的关键字“歌手”和分词“歌曲B”对应的关键字“歌曲”,均包含在音乐领域的关键字模型702中。即终端可以确定文本信息1的所有分词对应的关键字都分布在音乐领域的关键字模型702中。在这种情况下,终端可以确定文本信息1归属于音乐领域的置信度为90%,文本信息1归属于设置领域的置信度为10%。
再例如,假设终端中预先设置了三个事件领域,如音乐领域、设置领域和APP领域。终端对文本信息2进行分词处理得到至少一个分词。当该至少一个分词对应的关键字都分布在设置领域的关键字模型中时,终端可以确定文本信息2归属于设置领域的置信度为80%,文本信息2归属于音乐领域的置信度为10%,文本信息2归属于APP领域的置信度为10%。
又例如,假设终端中预先设置了三个事件领域,如音乐领域、设置领域和APP领域。终端对文本信息3进行分词处理得到8个分词。当这8个分词中的5个分词对应的关键字分布在设置领域的关键字模型中,2个分词对应的关键字分布在音乐领域的关键字模型中,1个分词对应的关键字分布在APP领域的关键字模型中时,终端可以确定文本信息3归属于设置领域的置信度为5/8=62.5%,文本信息3归属于音乐领域的置信度为25%,文本信息3归属于APP领域的置信度为12.5%。
需要说明的是,当M=N时,本申请实施例对终端执行S402、S403和S404的先后顺序不作限制。例如,终端可以先执行S403,再执行S404,最后执行S402;或者,终端可以先执行S404,再执行S402,最后执行S404;或者,终端可以基本同时执行S402、S403和S404。
当N<M时,在S402之后,本申请的方法还可以包括S402'。在这种情况下,终端可以先执行S402,再执行S402',最后执行S403和S404。本申请实施例对终端执行S403和S404的先后顺序不作限制。例如,终端可以先执行S403,再执行S404;或者,终端可以先执行S404,再执行S403;或者,终端可以基本同时执行S403和S404。
S405、终端根据上述文本信息归属于上述N个事件领域中的每个事件领域的领域概率、先验概率和置信度,计算上述文本信息分别归属于所述N个事件领域的N个概率值。
其中,终端可以计算文本信息归属于第一事件领域的先验概率、领域概率和置信度的乘积,将计算得到的乘积确定为该文本信息归属于第一事件领域的概率值。
示例性的,如图8所示,假设文本信息a归属于音乐领域的先验概率为40%,文本信息a归属于设置领域的先验概率为30%,文本信息a归属于APP领域的先验概率为30%;文本信息a归属于音乐领域的领域概率为40%,文本信息a归属于设置领域的领域概率为20%,文本信息a归属于APP领域的领域概率为40%;文本信息a归属于音乐领域的置信度为10%,文本信息a归属于设置领域的置信度为10%,文本信息a归属于APP领域的置信度为80%。终端可以计算得到文本信息a归属于音乐领域的 概率值为40%×40%×10%=1.6%,文本信息a归属于设置领域的概率值为30%×20%×10%=0.6%,文本信息a归属于APP领域的概率值为30%×40%×80%=9.6%。
S406、终端输出根据所述N个概率值中概率值最高的事件领域对文本信息进行语义理解的语义理解结果。
可选的,所述S406可以被替换为:终端将根据所述N个概率值中概率值最高的事件领域对文本信息进行语义理解的语义理解结果作为最终的语义理解结果。
对该文本信息对应的每个事件领域均执行上述步骤402-405之后,可以获取该文本信息归属于每个事件领域的概率值,即获取到多个概率值。然后终端获取最高的一个概率值对应的事件领域,即将概率值最高的事件领域识别为该文本信息对应的事件领域。其中,终端在对文本信息进行领域识别后,可以将文本信息传输至识别到的事件领域的对话引擎,由对话引擎对该文本信息进行语义理解,得到语义理解结果。或者,本实施例可以不限定终端进行领域识别和语义理解的顺序,可以同时或基本同时进行领域识别和语义理解,也可以先进行语义理解后进行领域识别。
如图8所示,音乐领域的对话引擎、设置领域的对话引擎和APP领域的对话引擎可以分别对文本信息a进行语义理解,得到语义理解结果。在进行领域识别之后,即获知到文本信息a归属于APP领域的概率值9.6%大于文本信息a归属于音乐领域的概率值1.6%,并且文本信息a归属于APP领域的概率值9.6%大于文本信息a归属于设置领域的概率值0.6%之后;终端可以输出APP领域的对话引擎对文本信息a进行语义理解得到的语义理解结果。
例如,终端在执行S402之后,S406之前,可以在N个事件领域,分别对上述文本信息进行语义理解,得到N个语义理解结果。具体的,在S402之后,S406之前,本申请实施例的方法还可以包括S406':
S406'、终端在N个事件领域,分别对上述文本信息进行语义理解,得到N个语义理解结果。
其中,终端在N个事件领域中的每个事件领域,分别对上述文本信息进行语义理解,得到N个语义理解结果的方法,可以参考本申请上述实施例中的相关描述,本申请实施例这里不再赘述。
进一步的,终端在输出上述语义理解结果后,还可以根据该语义理解结果,执行所述语义理解结果对应的操作。具体的,在上述S406之后,本申请实施例的方法还可以包括S407:
S407、终端输出上述语义理解结果之后,终端根据上述语义理解结果,执行上述语义理解结果对应的操作。
需要说明的是,本实施例中终端将根据所述概率值最高的事件领域对文本信息进行语义理解的语义理解结果作为最终识别的语义理解结果。在确定出最终的语义理解结果之后,所述终端可以向所述终端内部输出该最终的结果,使得所述终端执行该最终结果对应的操作。可以理解的,所述的向所述终端内部输出,可以是终端确定概率值最高的最终结果的过程,也可以是终端向内部的其它部件(硬件或软件)发送最终结果,使得该最终结果对应的操作被所述终端执行。可选的,在确定出最终的语义理解结果之后,所述终端也可以向所述终端的外部输出该最终的语义理解结果,例如所 述终端可以向其它终端发送该最终的结果,使得其它终端获知该最终结果,或者使得其它终端来执行该最终结果对应的动作。可选的,所述终端可以既执行该最终结果对应操作,也把该最终结果向外部输出。
本申请实施例提供一种语音信息处理方法,可以在将语音信息转换为文本信息后,根据终端进行语义理解的历史数据,获取该文本信息归属于每个事件领域的先验概率;分析该文本信息,获取所述文本信息归属于每个事件领域的领域概率;并且终端可以计算文本信息归属于每个事件领域的置信度;然后,终端根据上述文本信息归属于一个事件领域的先验概率、领域概率和置信度,计算该文本信息归属于该事件领域的概率值;最后,终端可以将概率值最高的事件领域的对话引擎对该文本信息进行语义理解,得到的语义理解结果作为该文本信息(即上述语音信息)的语义理解结果。
其中,文本信息归属于一个事件领域的先验概率:用于表征历史数据中,文本信息归属于该事件领域的概率;文本信息归属于一个事件领域的领域概率:用于表征该文本信息归属于该事件领域的可能性;文本信息归属于一个事件领域的置信度:用于表征该文本信息归属于该事件领域的确信程度。本申请实施例在选择处理文本信息的事件领域时,不仅参考了对文本信息中包括的词汇进行分析得到的领域概率,还参考了文本信息归属于事件领域的先验概率,以及用于表征该文本信息归属于该事件领域的确信程度的置信度;因此,可以提高选择的事件领域的准确性,进而可以提高语义理解结果的准确性,从而可以提高终端执行事件与用户输入的语音信息指示终端执行的事件的符合度,可以提高用户体验。
可选的,在一种可能的实现方式中,上述关键字模型中不仅可以包括多个关键字,还可以包括每个关键字指示文本信息归属于对应事件领域的概率。例如,如图9所示,关键字数据库901中包括音乐领域的关键字模型902和设置领域的关键字模型903。其中,音乐领域的关键字模型902中还可以包括:关键字“下一曲”指示文本信息归属于音乐领域的概率“概率a”、关键字“播放”指示文本信息归属于音乐领域的概率“概率b”、关键字“歌手”指示文本信息归属于音乐领域的概率“概率c”、关键字“播放”指示文本信息归属于音乐领域的概率“概率d”和关键字“歌曲”指示文本信息归属于音乐领域的概率“概率e”等。设置领域的关键字模型903中还可以包括:关键字“飞行模式”指示文本信息归属于设置领域的概率“概率1”、关键字“蓝牙”指示文本信息归属于设置领域的概率“概率2”、关键字“音量”指示文本信息归属于设置领域的概率“概率3”和关键字“调低”指示文本信息归属于设置领域的概率“概率4”等。
可以想到的是,本申请实施例中的关键字数据库,如关键字数据库701和关键字数据库901,可以保存在终端中。或者,为了减少关键字数据库对终端内存的占用,该关键字数据库也可以保存在云服务器中。终端可以从云服务器保存的关键字数据库中查找对应的关键字以及关键字所指示的概率。
其中,终端可以从上述文本信息中识别至少一个关键字;然后,根据至少一个关键字指示的概率,计算文本信息归属于每个事件领域的领域概率。具体的,上述S402 可以包括S1001-S1003。例如,如图10所示,图4中的S402可以替换为S1001-S1003:
S1001、终端从所述文本信息中识别至少一个关键字。
其中,终端可以针对每一个事件领域,识别文本信息中是否包括该事件领域的关键字模型中的关键字。例如,假设终端中预先设置了两个事件领域,如音乐领域和设置领域,文本信息4为“播放下一曲时,调低歌曲的音量”。终端可以识别到该文本信息4中包括关键字“播放”、“下一曲”“调低”、“歌曲”和“音量”。其中,“播放”、“下一曲”和“歌曲”是音乐领域的关键字模型中的关键字,“调低”和“音量”是设置领域的关键字模型中的关键字。
S1002、终端从上述每个事件领域对应的关键字模型中获取上述至少一个关键字分别指示的概率。
示例性的,如图9所示,关键字“播放”指示文本信息归属于音乐领域的概率为概率b,关键字“下一曲”指示文本信息归属于音乐领域的概率为概率a,关键字“歌曲”指示文本信息归属于音乐领域的概率为概率e。如图9所示,关键字“调低”指示文本信息归属于设置领域的概率为概率4,关键字“音量”指示文本信息归属于设置领域的概率为概率3。
S1003、终端根据上述至少一个关键字分别指示的概率,计算上述文本信息归属于上述每个事件领域的领域概率。
例如,上述文本信息4归属于音乐领域的领域概率可以为概率b、概率a与概率e之和;上述文本信息4归属于设置领域的领域概率可以为概率4与概率3之和。
可选的,本申请实施例中,终端还可以对上述至少一个关键字指示的概率进行归一化,以计算得到文本信息归属于每个事件领域的领域概率。例如,上述文本信息4归属于音乐领域的领域概率可以为(概率b+概率a+概率e)/3;上述文本信息4归属于设置领域的领域概率可以为(概率4+概率3)/2。
本申请实施例中,终端可以从文本信息中识别至少一个关键字,然后根据至少一个关键字指示的概率,计算文本信息归属于每个事件领域的领域概率。其中,由于至少一个关键字中可能包括各个事件领域的关键字模型中的关键字,而每个关键字在不同的事件领域的关键字模型中,可以指示上述文本信息归属于对应事件领域的概率;因此,根据文本信息中包括的各个事件领域的关键字所指示的概率,可以计算得到文本信息归属于各个事件领域的领域概率。
可选的,在另一种可能的实现方式中,终端可以维护一个特征数据库,该特征数据库中包括上述多个事件领域的数据库模型。每个数据库模型中包括多个特征和每个特征的权重及其对应的分词,该特征的权重用于指示对应特征归属于对应事件领域的概率。终端可以针对任一事件领域,执行以下操作以计算文本信息归属于该事件领域的领域概率:对文本信息进行分词处理提取到至少一个分词,然后从该事件领域的数据库模型中查找至少一个分词对应的特征,再根据查找到的特征的权重,计算该文本信息归属于该事件领域的领域概率。具体的,上述S402可以包括S1101-S1103。例如,如图11所示,图4中的S402可以替换为S1101-S1103:
S1101、终端对文本信息进行分词处理,并提取至少一个分词。
其中,终端对文本信息进行分词处理,并提取至少一个分词的方法可以参考上述 实施例中S404a中的详细描述,本申请实施例这里不再赘述。
S1102、终端从上述每个事件领域对应的数据库模型中查找上述至少一个分词对应的特征,该数据库模型中包括多个特征、每个特征的权重及每个特征对应的分词,该权重用于指示该权重对应的特征归属于上述数据库模型中对应的事件领域的概率;其中,每个事件领域对应一个数据库模型。
其中,终端可以统计上述每个事件领域中出现过的多个分词,然后为每个分词分配一个可以位于标识该分词的特征。例如,终端可以为每个分词分配一个唯一标识该分词的数字,该数字可以是十进制的数字,也可以是二进制的数字,或者该数字还可以是其他格式的数字,本申请实施例对数字的格式不做限制。然后,终端可以根据历史语义理解结果中,上述各个分词归属于各个事件领域的概率值,确定出每个分词对应的特征归属于各个事件领域的概率。
例如,如图12所示,终端可以维护特征数据库1201,该特征数据库1201中包括事件领域1的数据库模型1202和事件领域2的数据库模型1203等。其中,事件领域1的数据库模型1202中包括:特征102、特征102对应的分词a和特征102在事件领域1的权重30%;特征23、特征23对应的分词b和特征23在事件领域1的权重15%;特征456、特征456对应的分词c和特征456在事件领域1的权重26%;特征78、特征78对应的分词d和特征78在事件领域1的权重81%。事件领域2的数据库模型1203中包括:特征375、特征375对应的分词e和特征375在事件领域2的权重62%;特征102、特征102对应的分词a和特征102在事件领域2的权重40%;特征168、特征268对应的分词f和特征168在事件领域2的权重2%;特征456、特征456对应的分词c和特征456在事件领域2的权重53%。
需要说明的是,在上述特征数据库中,同一分词在不同事件领域的数据库模型中的特征相同,即在特征数据库中,分词的特征可以唯一标识该分词。但是,同一分词在不同事件领域中的权重不同。
例如,如图12所示,在事件领域1的数据库模型1202和事件领域2的数据库模型1203中,分词a的特征均为102,分词c的特征均为456。而特征102在事件领域1的权重20%,在事件领域2的权重为40%;特征456在事件领域1的权重26%,在事件领域2的权重为53%。
示例性的,以上述文本信息为“让蓝光对眼睛的辐射少一点”为例,终端对该文本信息进行分词处理,可以得到一下分词:“让”、“蓝光”、“对”、“眼睛”、“的”、“辐射”、“少”、“一点”。假设分词“让”的特征为5545、分词“蓝光”的特征为2313、分词“对”的特征为2212、分词“眼睛”的特征为9807、分词“的”的特征为44、分词“辐射”的特征为3566、分词“少”的特征为4324、分词“一点”的特征为333。终端可以确定文本信息“让蓝光对眼睛的辐射少一点”的特征模型为:5545,2313,2212,9807,44,3566,4324,333。
S1103、终端根据从所述每个事件领域对应的数据库模型中查找到的特征的权重,计算所述文本信息归属于所述每个事件领域的领域概率。
其中,终端可以针对每个事件领域执行以下操作,计算对应事件领域的领域概率:终端从任一事件领域的数据库模型中,查找上述特征模型中的每个特征在该事件领域 中的权重;终端计算查找到的权重之和。其中,终端计算得到的权重之和为文本信息归属于该事件领域的领域概率。
例如,假设终端中设置了两个事件领域(如音乐领域和设置领域)。上述特征模型“5545,2313,2212,9807,44,3566,4324,333”中,特征5545在音乐领域的权重为21%,特征5545在设置领域的权重为79%;特征2313在音乐领域的权重为12%,特征5545在设置领域的权重为88%;特征2212在音乐领域的权重为69%,特征5545在设置领域的权重为31%;特征9807在音乐领域的权重为56%,特征5545在设置领域的权重为44%;特征44在音乐领域的权重为91%,特征5545在设置领域的权重为9%;特征3566在音乐领域的权重为56%,特征5545在设置领域的权重为44%;特征4324在音乐领域的权重为75%,特征5545在设置领域的权重为25%;特征333在音乐领域的权重为12%,特征5545在设置领域的权重为88%。
那么,终端可以计算得到:文本信息归属于音乐领域的领域概率为(21%+12%+69%+56%+23%+56%+75%+12%)/8=40.5%,文本信息归属于设置领域的领域概率为(79%+88%+31%+44%+77%+44%+25%+88%)/8=59.5%。
可选的,在另一种可能的实现方式中,终端可以维护一个特征数据库,该特征数据库中包括上述多个事件领域的数据库模型和一个特征关系模型。每个数据库模型中包括多个特征和每个特征的权重,该特征的权重用于指示对应特征归属于对应事件领域的概率。该特征关系模型中包括多个特征以及每个特征对应的分词。
例如,如图13所示,终端可以维护一个特征数据库1301,该特征数据库1201中包括事件领域1的数据库模型1302、事件领域2的数据库模型1303和特征关系模型1304。其中,特征关系模型1304中包括:分词a和分词a对应的特征102;分词b和分词b对应的特征23;分词c和分词c对应的特征456;分词d和分词d对应的特征78;分词e和分词e对应的特征375;……;分词f和分词f对应的特征168等。事件领域1的数据库模型1302中包括:特征102和特征102在事件领域1的权重30%;特征23和特征23在事件领域1的权重15%;特征456和特征456在事件领域1的权重26%;特征78和特征78在事件领域1的权重81%。事件领域2的数据库模型1303中包括:特征375和特征375在事件领域2的权重62%;特征102和特征102在事件领域2的权重40%;特征168和特征168在事件领域2的权重2%;特征456和特征456在事件领域2的权重53%。
其中,终端在执行S1102得到至少一个分词后,可以先从图13所示的特征关系模型1304中,查找该至少一个分词对应的特征,确定出文本信息的特征模型;然后,从事件领域1的数据库模型1302中,查找该特征模型中的特征在上述事件领域1的权重,以计算文本信息归属于事件领域1的领域概率;从事件领域2的数据库模型1303中,查找该特征模型中的特征在上述事件领域2的权重,以计算文本信息归属于事件领域2的领域概率。
需要说明的是,在执行S404a-S404c计算文本信息归属于每个事件领域的确信度时,终端执行了“对文本信息进行分词处理,并提取至少一个分词”;而在执行S1101-S1103计算文本信息归属于每个事件领域的领域概率时,终端还执行了“对文本信息进行分词处理,并提取至少一个分词”。为了避免终端重复执行了以下操作“对 文本信息进行分词处理,并提取至少一个分词”,在终端执行S404a-S404c计算文本信息归属于该第一事件领域的确信度,执行S1101-S1103计算文本信息归属于第一事件领域的领域概率时,终端可以只执行S404a,不执行S1101;或者该终端可以只执行S1101,不执行S404a。
本申请实施例中提供的方法,不仅可以应用于终端与用户进行的单轮语音对话过程中,还可以应用于终端与用户进行的多轮语音对话的过程中。其中,本申请实施例中所述的单轮语音对话是指用户与终端采用一问一答的模式进行语音对话。但是,在一些场景中,当用户向终端输入一个语音信息(如语音信息a)后,在该终端还响应该语音信息a回复用户时,该用户再次输入了另一语音信息(如语音信息b)。此时,由于终端几乎同时接收到语音信息a和语音信息b,因此该终端要同时处理该语音信息a和语音信息b。例如,上述语音信息a对应的文本信息a为“我要去西藏”,上述语音信息b对应的文本信息b为“今天天气怎么样”。
其中,本申请实施例将终端接收并处理上述语音信息a和语音信息b的对话过程中,称为多轮语音对话。本申请实施例中,终端可以将上述语音信息a和语音信息b转换为文本信息,并对文本信息a和文本信息b进行语义理解。
示例性的,由于终端在接收语音信息b时,可能还未对文本信息a进行语义理解,由此,终端可以认为语音信息b的前一个语音信息和语音信息a的前一个语言信息是相同的。由于先验概率取决于前一个语音信息对应的事件领域,因此文本信息a和文本信息b归属于某一事件领域的先验概率相同。例如,如图14所示,文本信息a和文本信息b归属于音乐领域的先验概率P1均为40%,文本信息a和文本信息b归属于设置领域的先验概率P1均为30%,文本信息a和文本信息b归属于APP领域的先验概率P1均为30%。对于领域概率和置信度,可以分别对文本信息a和文本信息b进行计算。文本信息a归属于音乐领域的领域概率P2-a为40%,文本信息a归属于设置领域的领域概率P2-a为20%,文本信息a归属于APP领域的领域概率P2-a为40%;文本信息b归属于音乐领域的领域概率P2-b为20%,文本信息b归属于设置领域的领域概率P2-b为10%,文本信息b归属于APP领域的领域概率P2-b为70%。文本信息a归属于音乐领域的置信度P3-a为10%,文本信息a归属于设置领域的置信度P3-a为10%,文本信息a归属于APP领域的置信度P3-a为80%;文本信息b归属于音乐领域的置信度P3-b为60%,文本信息b归属于设置领域的置信度P3-b为30%,文本信息b归属于APP领域的置信度P3-b为10%。
其中,如图14所示,终端可以同时或基本同时计算得到:文本信息a归属于音乐领域的概率值P-a为40%×40%×10%=1.6%,文本信息a归属于设置领域的概率值P-a为30%×20%×10%=0.6%,文本信息a归属于APP领域的概率值P-a为30%×40%×80%=9.6%;文本信息b归属于音乐领域的概率值P-b为40%×20%×60%=4.8%,文本信息b归属于设置领域的概率值P-b为30%×10%×30%=0.9%,文本信息b归属于APP领域的概率值P-b为30%×70%×10%=2.1%。
如图14所示,音乐领域的对话引擎、设置领域的对话引擎和APP领域的对话引擎可以分别对文本信息a和文本信息b进行语义理解,得到语义理解结果。由于文本 信息a归属于APP领域的概率值9.6%大于归属于音乐领域的概率值1.6%,也大于归属于设置领域的概率值0.6%;因此,终端可以输出APP领域的对话引擎对文本信息a进行语义理解得到的语义理解结果。由于文本信息b归属于音乐领域的概率值4.8%大于归属于APP领域的概率值2.1%,也大于归属于设置领域的概率值0.9%;因此,终端可以输出音乐领域的对话引擎对文本信息b进行语义理解得到的语义理解结果。
本申请实施例提供一种语音信息处理方法,不仅可以应用于终端与用户进行的单轮语音对话过程中,还可以应用于终端与用户进行的多轮语音对话的过程中。本申请实施例提供的方法,无论应用于终端与用户进行的单轮语音对话过程中,还是应用于终端与用户进行的多轮语音对话的过程中,都可以提高选择的事件领域的准确性,进而可以提高语义理解结果的准确性,从而可以提高终端执行事件与用户输入的语音信息指示终端执行的事件的符合度,可以提高用户体验。
可以理解的是,上述终端等为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本发明实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明实施例的范围。
本申请实施例可以根据上述方法示例对上述终端等进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本发明实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
在采用对应各个功能划分各个功能模块的情况下,图15示出了上述实施例中所涉及的终端中的语音信息处理装置的一种可能的结构示意图,该语音信息处理装置1500包括:接收单元1501、转换单元1502、第一获取单元1503、第二获取单元1504、第三获取单元1505、计算单元1506和输出单元1507。
其中,上述接收单元1501用于支持终端执行方法实施例中的S401中“接收语音信息”的操作,和/或用于本文所描述的技术的其它过程。
上述转换单元1502用于支持终端执行方法实施例中的S401中“将语音信息转换为文本信息”的操作,和/或用于本文所描述的技术的其它过程。
上述第一获取单元1503用于支持终端执行方法实施例中的S402、S1001-S1003、S1101-S1103,和/或用于本文所描述的技术的其它过程。
上述第二获取单元1504用于支持终端执行方法实施例中的S403,和/或用于本文所描述的技术的其它过程。
上述第三获取单元1505用于支持终端执行方法实施例中的S404、S404a-S404c,和/或用于本文所描述的技术的其它过程。
上述计算单元1506用于支持终端执行方法实施例中的S405,和/或用于本文所描述的技术的其它过程。
上述输出单元1507用于支持终端执行方法实施例中的S406,和/或用于本文所描 述的技术的其它过程。
进一步的,上述语音信息处理装置1500还可以包括:语义理解单元。该语义理解单元用于支持终端执行方法实施例中的S406',和/或用于本文所描述的技术的其它过程。
进一步的,上述语音信息处理装置1500还可以包括:存储单元。该存储单元用于保存方法实施例中所述的关键字模型和数据库模型等信息。
进一步的,上述语音信息处理装置1500还可以包括:执行单元。该执行单元用于支持终端执行方法实施例中的S407,和/或用于本文所描述的技术的其它过程。
其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
当然,语音信息处理装置1500包括但不限于上述所列举的单元模块,例如,如图16所示,语音信息处理装置1500还可以包括第四获取单元1508,该第四获取单元用于支持终端执行方法实施例中的S402',和/或用于本文所描述的技术的其它过程。
并且,上述功能单元的具体所能够实现的功能也包括但不限于上述实例所述的方法步骤对应的功能,语音信息处理装置1500的其他单元的详细描述可以参考其所对应方法步骤的详细描述,本申请实施例这里不再赘述。
需要说明的是,上述语义理解单元可以对应于图2所示的对话引擎层202中的一个或多个对话引擎。上述转换单元1502可以对应于图2所示的领域识别模块2012。上述第一获取单元1503、第二获取单元1504、第三获取单元1505、计算单元1506的功能可以集成在图2所示的DS模块2014中实现。可以理解,上述语义理解单元、转换单元1502、第一获取单元1503、第二获取单元1504、第三获取单元1505、计算单元1506等的功能都可以集成在一个处理模块中实现,该处理模块可以是处理器或控制器,例如可以是中央处理器(Central Processing Unit,CPU),通用处理器,数字信号处理器(Digital Signal Processor,DSP),专用集成电路(Application-Specific Integrated Circuit,ASIC),现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。所述处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等等。上述接收单元1501和输出单元1507可以对应于图2所示的VSI 2011。该VSI 2011可以是终端的处理器的一个接口。上述存储单元可以是用于保存图2所示的算法层203中的算法和规则等的存储模块。该存储模块可以是存储器。
在采用集成的单元的情况下,图17示出了上述实施例中所涉及的终端的一种可能的结构示意图。该终端1700包括:处理模块1701和存储模块1702。存储模块1702用于保存终端的程序代码和数据(如算法和规则等)。处理模块1701用于执行存储模块1702保存的程序代码执行方法实施例所述的语音信息处理方法。进一步的,该终端1700还可以包括通信模块1703,该通信模块1703用于支持终端与其他网络实体的通信。通信模块1703可以是收发器、收发电路或通信接口等。存储模块1702可以是存储器。
当处理模块1701为处理器(如图1所示的处理器101),通信模块1703为RF收 发电路(如图1所示的射频电路102),存储模块1702为存储器(如图1所示的存储器103)时,本发明实施例所提供的终端可以为图1所示的终端100。其中,上述通信模块1703不仅可以包括射频电路,还可以包括WiFi模块和蓝牙模块。射频电路、WiFi模块和蓝牙模块等通信模块可以统称为通信接口。本申请实施例的终端中可以包括一个或多个处理器和一个或多个存储器,上述一个或多个处理器、一个或多个存储器和通信接口可以通过总线耦合在一起。
本申请实施例该提供一种电子设备,该电子设备包括上述实施例所述的用于执行上述实施例中的语音信息处理方法的语音信息处理装置1500。
本申请实施例还提供一种计算机存储介质,该计算机存储介质中存储有计算机程序代码,当上述处理器执行该计算机程序代码时,该电子设备执行图3、图4、图6、图10和图11中任一附图中的相关方法步骤实现上述实施例中的语音信息处理方法。
本申请实施例还提供了一种计算机程序产品,当该计算机程序产品在电子设备上运行时,使得电子设备执行图3、图4、图6、图10和图11中任一附图中的相关方法步骤实现上述实施例中的语音信息处理方法。
其中,本发明实施例提供的语音信息处理装置1500、终端1700、计算机存储介质或者计算机程序产品均用于执行上文所提供的对应的方法,因此,其所能达到的有益效果可参考上文所提供的对应的方法中的有益效果,此处不再赘述。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请实施例所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请实施例各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件 产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:快闪存储器、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请实施例的具体实施方式,但本申请实施例的保护范围并不局限于此,任何在本申请实施例揭露的技术范围内的变化或替换,都应涵盖在本申请实施例的保护范围之内。因此,本申请实施例的保护范围应以所述权利要求的保护范围为准。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:快闪存储器、移动硬盘、只读存储器、随机存取存储器、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何在本申请揭露的技术范围内的变化或替换,都应涵盖在本申请的保护范围之 内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (18)

  1. 一种语音信息处理方法,其特征在于,所述方法包括:
    终端接收语音信息,将所述语音信息转换为文本信息;所述终端中预设M个事件领域;
    所述终端获取所述文本信息归属于所述M个事件领域中的每个事件领域的领域概率,所述领域概率用于表征所述文本信息归属于一个事件领域的可能性;
    所述终端获取所述文本信息归属于N个事件领域中的每一个事件领域的先验概率,所述先验概率用于表征根据已进行的多次语义理解,确定所述文本信息归属于一个事件领域的概率,所述N个事件领域为所述M个事件领域中的N个事件领域,N小于或等于M;
    所述终端获取所述文本信息归属于所述N个事件领域中的每个事件领域的置信度,所述置信度用于表征所述文本信息归属于一个事件领域的确信程度;
    所述终端根据所述文本信息归属于所述N个事件领域中的每个事件领域的领域概率、先验概率和置信度,计算所述文本信息分别归属于所述N个事件领域的N个概率值;
    所述终端输出根据所述N个概率值中概率值最高的事件领域对文本信息进行语义理解的语义理解结果。
  2. 根据权利要求1所述的方法,其特征在于,当N小于M时,所述N个事件领域是所述预设M个事件领域中,领域概率按照由高至低的顺序排列在前N位的N个事件领域,N≥2。
  3. 根据权利要求1或2所述的方法,其特征在于,在所述终端获取所述文本信息归属于M个事件领域中的每个事件领域的领域概率之后,所述方法还包括:
    所述终端在所述N个事件领域,分别对所述文本信息进行语义理解,得到N个语义理解结果。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述M个事件领域中的每个事件领域对应于一个关键字模型,所述关键字模型中包括:对应事件领域的多个关键字;
    所述终端获取所述文本信息归属于所述N个事件领域中的每个事件领域的置信度,包括:
    所述终端对所述文本信息进行分词处理,并提取至少一个分词;
    所述终端获取所述至少一个分词对应的关键字在所述每个事件领域的关键字模型中的分布信息;
    所述终端根据所述分布信息,计算所述文本信息归属于所述N个事件领域中的每个事件领域的置信度。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述终端获取所述文本信息归属于M个事件领域中的每个事件领域的领域概率,包括:
    所述终端对所述文本信息进行分词处理,并提取至少一个分词;
    所述终端从所述每个事件领域对应的数据库模型中查找所述至少一个分词对应的特征,所述数据库模型中包括多个特征、每个特征的权重及每个特征对应的分词,所 述权重用于指示所述权重对应的特征归属于所述数据库模型中对应的事件领域的概率;其中,每个事件领域对应一个数据库模型;
    所述终端根据从所述每个事件领域对应的数据库模型中查找到的特征的权重,计算所述文本信息归属于所述每个事件领域的领域概率。
  6. 根据权利要求1-4中任一项所述的方法,其特征在于,所述M个事件领域中的每个事件领域对应于一个关键字模型,所述关键字模型中包括:多个关键字和每个关键字指示文本信息归属于所述关键字模型对应的事件领域的概率;
    所述终端获取所述文本信息归属于M个事件领域中的每个事件领域的领域概率,包括:
    所述终端从所述文本信息中识别至少一个关键字;
    所述终端从所述每个事件领域对应的关键字模型中获取所述至少一个关键字分别指示的概率;
    所述终端根据所述至少一个关键字分别指示的概率,计算所述文本信息归属于所述每个事件领域的领域概率。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,还包括:
    所述终端输出所述语义理解结果之后,所述终端根据所述语义理解结果,执行所述语义理解结果对应的操作。
  8. 一种语音信息处理装置,其特征在于,所述装置包括:
    接收单元,用于接收语音信息;
    转换单元,用于将所述接收单元接收的所述语音信息转换为文本信息;所述终端中预设M个事件领域;
    第一获取单元,用于获取所述转换单元转换得到的所述文本信息归属于所述M个事件领域中的每个事件领域的领域概率,所述领域概率用于表征所述文本信息归属于一个事件领域的可能性;
    第二获取单元,用于获取所述转换单元转换得到的所述文本信息归属于N个事件领域中的每一个事件领域的先验概率,所述先验概率用于表征根据已进行的多次语义理解,确定所述文本信息归属于一个事件领域的概率,所述N个事件领域为所述M个事件领域中的N个事件领域,N小于或等于M;
    第三获取单元,用于获取所述转换单元转换得到的所述文本信息归属于所述N个事件领域中的每个事件领域的置信度,所述置信度用于表征所述文本信息归属于一个事件领域的确信程度;
    计算单元,用于根据所述第一获取单元获取的所述文本信息归属于所述N个事件领域中的每个事件领域的领域概率、所述第二获取单元获取的先验概率和所述第三获取单元获取的置信度,计算所述文本信息分别归属于所述N个事件领域的N个概率值;
    输出单元,用于输出根据所述计算单元计算得到的所述N个概率值中概率值最高的事件领域对文本信息进行语义理解的语义理解结果。
  9. 根据权利要求8所述的装置,其特征在于,当N小于M时,所述N个事件领域是所述预设M个事件领域中,领域概率按照由高至低的顺序排列在前N位的N个事件领域,N≥2。
  10. 根据权利要求8或9所述的装置,其特征在于,所述装置还包括:
    语义理解单元,用于在所述第一获取单元获取所述文本信息归属于M个事件领域中的每个事件领域的领域概率之后,在所述N个事件领域,分别对所述文本信息进行语义理解,得到N个语义理解结果。
  11. 根据权利要求8-10中任一项所述的装置,其特征在于,所述装置还包括:
    存储单元,用于保存所述M个事件领域中的每个事件领域对应的关键字模型,所述关键字模型中包括:对应事件领域的多个关键字;
    所述第三获取单元,具体用于:
    对所述文本信息进行分词处理,并提取至少一个分词;
    获取所述至少一个分词对应的关键字在所述存储单元保存的所述每个事件领域的关键字模型中的分布信息;
    根据所述分布信息,计算所述文本信息归属于所述N个事件领域中的每个事件领域的置信度。
  12. 根据权利要求8-11中任一项所述的装置,其特征在于,所述第一获取单元,具体用于:
    对所述文本信息进行分词处理,并提取至少一个分词;
    从所述每个事件领域对应的数据库模型中查找所述至少一个分词对应的特征,所述数据库模型中包括多个特征、每个特征的权重及每个特征对应的分词,所述权重用于指示所述权重对应的特征归属于所述数据库模型中对应的事件领域的概率;其中,每个事件领域对应一个数据库模型;
    根据从所述每个事件领域对应的数据库模型中查找到的特征的权重,计算所述文本信息归属于所述每个事件领域的领域概率。
  13. 根据权利要求8-11中任一项所述的装置,其特征在于,所述装置还包括:
    存储单元,用于保存所述M个事件领域中的每个事件领域对应的关键字模型,所述关键字模型中包括:多个关键字和每个关键字指示文本信息归属于所述关键字模型对应的事件领域的概率;
    所述第一获取单元,具体用于:
    从所述文本信息中识别至少一个关键字;
    从所述每个事件领域对应的关键字模型中获取所述至少一个关键字分别指示的概率;
    根据所述至少一个关键字分别指示的概率,计算所述文本信息归属于所述每个事件领域的领域概率。
  14. 根据权利要求8-13中任一项所述的装置,其特征在于,所述装置还包括:
    执行单元,用于在所述输出单元输出所述语义理解结果之后,根据所述语义理解结果,执行所述语义理解结果对应的操作。
  15. 一种终端,其特征在于,包括:一个或多个处理器和一个或多个存储器;所述一个或多个存储器中存储有一个或多个计算机程序,所述一个或多个计算机程序包括指令,当所述指令被所述一个或多个处理器执行时,使得所述终端执行如权利要求1-7任一所述的语音信息处理方法。
  16. 一种电子设备,其特征在于,所述电子设备包括执行如权利要求1-7中任一项所述的语音信息处理方法的装置。
  17. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在电子设备上运行时,使得所述电子设备执行如权利要求1-7中任一项所述的语音信息处理方法。
  18. 一种计算机可读存储介质,包括指令,其特征在于,当所述指令在电子设备上运行时,使得所述电子设备执行如权利要求1-7中任一项所述的语音信息处理方法。
PCT/CN2017/106168 2017-10-09 2017-10-13 一种语音信息处理方法、装置及终端 WO2019071607A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP17928115.9A EP3686758A4 (en) 2017-10-09 2017-10-13 VOCAL AND TERMINAL INFORMATION PROCESS AND DEVICE
CN201780091549.8A CN110720104B (zh) 2017-10-09 2017-10-13 一种语音信息处理方法、装置及终端
AU2017435621A AU2017435621B2 (en) 2017-10-09 2017-10-13 Voice information processing method and device, and terminal
US16/754,540 US11308965B2 (en) 2017-10-09 2017-10-13 Voice information processing method and apparatus, and terminal

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710931504 2017-10-09
CN201710931504.9 2017-10-09

Publications (1)

Publication Number Publication Date
WO2019071607A1 true WO2019071607A1 (zh) 2019-04-18

Family

ID=66101210

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/106168 WO2019071607A1 (zh) 2017-10-09 2017-10-13 一种语音信息处理方法、装置及终端

Country Status (5)

Country Link
US (1) US11308965B2 (zh)
EP (1) EP3686758A4 (zh)
CN (1) CN110720104B (zh)
AU (1) AU2017435621B2 (zh)
WO (1) WO2019071607A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951222A (zh) * 2019-11-26 2021-06-11 三星电子株式会社 电子装置及其控制方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112652307B (zh) * 2020-12-02 2024-10-01 北京博瑞彤芸科技股份有限公司 一种语音触发抽奖的方法、系统及电子设备
CN117059095B (zh) * 2023-07-21 2024-04-30 广州市睿翔通信科技有限公司 基于ivr的服务提供方法、装置、计算机设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101587493A (zh) * 2009-06-29 2009-11-25 中国科学技术大学 文本分类方法
CN105632487A (zh) * 2015-12-31 2016-06-01 北京奇艺世纪科技有限公司 一种语音识别方法和装置
CN105869629A (zh) * 2016-03-30 2016-08-17 乐视控股(北京)有限公司 语音识别方法及装置
CN107092593A (zh) * 2017-04-12 2017-08-25 华中师范大学 初等数学分层抽样应用题的句子语义角色识别方法及系统

Family Cites Families (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6714933B2 (en) * 2000-05-09 2004-03-30 Cnet Networks, Inc. Content aggregation method and apparatus for on-line purchasing system
US6904405B2 (en) * 1999-07-17 2005-06-07 Edwin A. Suominen Message recognition using shared language model
US7454326B2 (en) * 2002-03-27 2008-11-18 University Of Southern California Phrase to phrase joint probability model for statistical machine translation
US8015143B2 (en) 2002-05-22 2011-09-06 Estes Timothy W Knowledge discovery agent system and method
CN1719438A (zh) 2004-07-06 2006-01-11 台达电子工业股份有限公司 整合式对话系统及其方法
US7640160B2 (en) 2005-08-05 2009-12-29 Voicebox Technologies, Inc. Systems and methods for responding to natural language speech utterance
US9201979B2 (en) * 2005-09-14 2015-12-01 Millennial Media, Inc. Syndication of a behavioral profile associated with an availability condition using a monetization platform
US8209182B2 (en) * 2005-11-30 2012-06-26 University Of Southern California Emotion recognition system
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8326599B2 (en) * 2009-04-21 2012-12-04 Xerox Corporation Bi-phrase filtering for statistical machine translation
US8798984B2 (en) 2011-04-27 2014-08-05 Xerox Corporation Method and system for confidence-weighted learning of factored discriminative language models
US20130031476A1 (en) * 2011-07-25 2013-01-31 Coin Emmett Voice activated virtual assistant
US8914288B2 (en) * 2011-09-01 2014-12-16 At&T Intellectual Property I, L.P. System and method for advanced turn-taking for interactive spoken dialog systems
KR20140089862A (ko) * 2013-01-07 2014-07-16 삼성전자주식회사 디스플레이 장치 및 그의 제어 방법
US9269354B2 (en) 2013-03-11 2016-02-23 Nuance Communications, Inc. Semantic re-ranking of NLU results in conversational dialogue applications
JP6245846B2 (ja) * 2013-05-30 2017-12-13 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation 音声認識における読み精度を改善するシステム、方法、およびプログラム
CN105378830A (zh) * 2013-05-31 2016-03-02 朗桑有限公司 音频数据的处理
CN104424290A (zh) 2013-09-02 2015-03-18 佳能株式会社 基于语音的问答系统和用于交互式语音系统的方法
KR102222122B1 (ko) * 2014-01-21 2021-03-03 엘지전자 주식회사 감성음성 합성장치, 감성음성 합성장치의 동작방법, 및 이를 포함하는 이동 단말기
CN104050160B (zh) * 2014-03-12 2017-04-05 北京紫冬锐意语音科技有限公司 一种机器与人工翻译相融合的口语翻译方法和装置
EP2933067B1 (en) * 2014-04-17 2019-09-18 Softbank Robotics Europe Method of performing multi-modal dialogue between a humanoid robot and user, computer program product and humanoid robot for implementing said method
US10431214B2 (en) * 2014-11-26 2019-10-01 Voicebox Technologies Corporation System and method of determining a domain and/or an action related to a natural language input
EP3029607A1 (en) * 2014-12-05 2016-06-08 PLANET AI GmbH Method for text recognition and computer program product
US9805713B2 (en) * 2015-03-13 2017-10-31 Google Inc. Addressing missing features in models
CN106205607B (zh) * 2015-05-05 2019-10-29 联想(北京)有限公司 语音信息处理方法和语音信息处理装置
US11250218B2 (en) * 2015-12-11 2022-02-15 Microsoft Technology Licensing, Llc Personalizing natural language understanding systems
CN106095834A (zh) 2016-06-01 2016-11-09 竹间智能科技(上海)有限公司 基于话题的智能对话方法及系统
CN106407333B (zh) 2016-09-05 2020-03-03 北京百度网讯科技有限公司 基于人工智能的口语查询识别方法及装置
CN107193973B (zh) 2017-05-25 2021-07-20 百度在线网络技术(北京)有限公司 语义解析信息的领域识别方法及装置、设备及可读介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101587493A (zh) * 2009-06-29 2009-11-25 中国科学技术大学 文本分类方法
CN105632487A (zh) * 2015-12-31 2016-06-01 北京奇艺世纪科技有限公司 一种语音识别方法和装置
CN105869629A (zh) * 2016-03-30 2016-08-17 乐视控股(北京)有限公司 语音识别方法及装置
CN107092593A (zh) * 2017-04-12 2017-08-25 华中师范大学 初等数学分层抽样应用题的句子语义角色识别方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3686758A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951222A (zh) * 2019-11-26 2021-06-11 三星电子株式会社 电子装置及其控制方法

Also Published As

Publication number Publication date
US20200273463A1 (en) 2020-08-27
EP3686758A1 (en) 2020-07-29
CN110720104B (zh) 2021-11-19
EP3686758A4 (en) 2020-12-16
CN110720104A (zh) 2020-01-21
AU2017435621A1 (en) 2020-05-07
US11308965B2 (en) 2022-04-19
AU2017435621B2 (en) 2022-01-27

Similar Documents

Publication Publication Date Title
WO2021196981A1 (zh) 语音交互方法、装置和终端设备
KR101894499B1 (ko) 상태-종속 쿼리 응답
CN107102746B (zh) 候选词生成方法、装置以及用于候选词生成的装置
KR101758302B1 (ko) 컨텍스트에 기초한 음성 인식 문법 선택
WO2018014341A1 (zh) 展示候选项的方法和终端设备
US20140025371A1 (en) Method and apparatus for recommending texts
CN113228064A (zh) 用于个性化的机器学习模型的分布式训练
US9754581B2 (en) Reminder setting method and apparatus
JP2020537198A (ja) 音楽を特定の歌曲として識別する
CN105446994A (zh) 业务推荐方法和具有智能助手的装置
US20200051560A1 (en) System for processing user voice utterance and method for operating same
US10868778B1 (en) Contextual feedback, with expiration indicator, to a natural understanding system in a chat bot
CN104123937A (zh) 提醒设置方法、装置和系统
KR102701423B1 (ko) 음성 인식을 수행하는 전자 장치 및 전자 장치의 동작 방법
US20200380076A1 (en) Contextual feedback to a natural understanding system in a chat bot using a knowledge model
US9092505B1 (en) Parsing rule generalization by n-gram span clustering
WO2024036616A1 (zh) 一种基于终端的问答方法及装置
CN110720104B (zh) 一种语音信息处理方法、装置及终端
US9183196B1 (en) Parsing annotator framework from external services
US10630619B2 (en) Electronic device and method for extracting and using semantic entity in text message of electronic device
CN113838479B (zh) 单词发音评测方法、服务器及系统
WO2019104669A1 (zh) 一种输入信息的方法及终端
US9299339B1 (en) Parsing rule augmentation based on query sequence and action co-occurrence
US20200265081A1 (en) Method and device for creating album title
US20200193981A1 (en) Personalized phrase spotting during automatic speech recognition

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017928115

Country of ref document: EP

Effective date: 20200421

ENP Entry into the national phase

Ref document number: 2017435621

Country of ref document: AU

Date of ref document: 20171013

Kind code of ref document: A