CN111373473B

CN111373473B - Method for voice recognition of electronic equipment and electronic equipment

Info

Publication number: CN111373473B
Application number: CN201880074893.0A
Authority: CN
Inventors: 隋志成; 李艳明
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-03-05
Filing date: 2018-03-05
Publication date: 2023-10-20
Anticipated expiration: 2038-03-05
Also published as: CN111373473A; WO2019169536A1

Abstract

A voice recognition method for electronic equipment and the electronic equipment relate to the technical field of terminals and can improve the flexibility of the terminals in the process of locally recognizing voice instructions. The method comprises the following steps: converting the received voice instruction into a text, and then carrying out domain recognition on the text through at least two sub-domain classifiers to obtain a domain recognition result, wherein the domain recognition result is used for representing the domain to which the text belongs, and then processing the text through a dialog engine corresponding to the domain to which the text belongs to determine the function to be executed by the electronic equipment corresponding to the text. Is suitable for the voice recognition process.

Description

Method for voice recognition of electronic equipment and electronic equipment

Technical Field

The present application relates to the field of terminal technologies, and in particular, to a method for performing voice recognition on an electronic device and an electronic device.

Background

With the development of terminal technology, especially the popularization of voice recognition technology, currently, a user can call a terminal to execute a corresponding function by inputting a voice command to the terminal. Taking the terminal as an example of a mobile phone, a user can input a section of voice through the mobile phone, and then the mobile phone sends the section of voice to the cloud end, so that the cloud end converts the section of voice into a text, and the text is processed to obtain a processing result. And the cloud returns the processing result to the mobile phone so that the mobile phone executes the function matched with the processing result according to the processing result.

It can be seen that the implementation process mainly depends on the processing capability of the cloud. That is, in the case that the terminal cannot realize data interaction with the cloud, the terminal cannot execute the corresponding function according to the input voice command. In order to solve the above problems, at present, a function of recognizing and processing a voice command is added in a terminal, after the terminal converts voice into text through a voice recognition technology, the terminal processes the text in a template matching manner to determine a function required to be invoked by the terminal, namely, the processing result. The template matching means that the terminal matches the obtained text with the existing template, and determines the template which can be completely matched with the text. The terminal can then determine the function corresponding to the template according to the correspondence between the template and the function, and the terminal executes the function.

But for the above implementations it is necessary to ensure that the resulting text matches the template exactly. For example, the template defines that the structure of the text is "time+place+do" and then the terminal can determine that the text matches the template when the structure of the text satisfies the structure of "time+place+do". When the structure of the text is a structure of 'place + time + what is done', the terminal cannot determine the function matched with the text because the template matched with the text cannot be found due to the fact that the template structure cannot be completely matched with the text, and therefore the user cannot call the terminal to execute the function in a mode of inputting voice instructions.

Disclosure of Invention

The embodiment of the application provides a voice recognition method for electronic equipment and the electronic equipment, so as to improve the flexibility of a terminal in carrying out voice instruction recognition locally.

In order to achieve the above purpose, the embodiment of the application adopts the following technical scheme:

in a first aspect, an embodiment of the present application provides a method for performing speech recognition by an electronic device. The method comprises the following steps: and converting the received voice instruction into text. And then carrying out domain recognition on the text through at least two sub-domain classifiers to obtain a domain recognition result. The domain identification result is used for indicating the domain to which the text belongs. And processing the text by a dialog engine corresponding to the field to which the text belongs, and determining the function to be executed by the electronic equipment corresponding to the text. By adopting the mode to realize the voice recognition process, the text can be effectively distinguished in the field, and then the text recognition process based on the field is completed more specifically, so that the function required to be executed by the electronic equipment is determined, and the accuracy of voice recognition is enhanced. Moreover, the implementation process described above may be performed locally on the electronic device. That is, even in the process that the electronic device cannot access the network, recognition of the voice command can be achieved without the aid of cloud processing capability, so that flexibility of voice recognition is improved.

In one exemplary implementation, after converting the voice instruction to text, the text may be matched with pre-stored text. And when the text is successfully matched with the pre-stored text, determining the field corresponding to the pre-stored text as a field identification result of the text. In the implementation manner, the pre-matching can reduce resources consumed by the subsequent domain identification through the sub-domain classifier. The matching process can carry out preliminary screening on the text, if the converted text accords with a common sentence pattern, the text can be directly based on the corresponding relation between the existing pre-stored text and the field, and the field to which the text belongs can be accurately identified under the condition that the participation of a sub-field classifier is not needed, so that the field identification process based on voice instructions is completed.

In an exemplary implementation manner, the text is subjected to parallel domain recognition through at least two sub-domain classifiers to obtain a domain recognition result, which can be specifically implemented as follows: and when the text fails to match with the pre-stored text, performing parallel domain recognition on the text through at least two sub-domain classifiers to obtain a domain recognition result. Considering that the text also has a condition of not conforming to the common sentence pattern, after the text is subjected to preliminary screening, the sub-domain classifier can perform domain recognition on the text. It should be noted that, the process of performing the domain recognition on the text by the sub-domain classifier may be implemented as performing parallel domain recognition on the text by a plurality of sub-domain classifiers, that is, at least two sub-domain classifiers perform the domain recognition on the text at the same time, so as to save the time occupied by the domain recognition.

In one exemplary implementation, an electronic device includes N sub-domain classifier groups, where each group has a different priority and N is a positive integer greater than or equal to 2. The text is subjected to parallel domain recognition through at least two sub-domain classifiers to obtain a domain recognition result, which can be specifically realized as follows: and carrying out domain recognition on the text through the sub-domain classifier in the highest priority group in the N sub-domain classifier groups. If the sub-domain classifier in the highest priority group identifies the domain to which the text belongs, the sub-domain classifier in the highest priority group identifies the domain to which the text belongs as a domain identification result; if the sub-domain classifier in the highest priority group does not identify the domain to which the text belongs, performing domain identification on the text through the sub-domain classifier in the next priority group in the N sub-domain classifier groups until: identifying the domain to which the text belongs, and taking the identified domain as a domain identification result; or the text has been subjected to domain recognition by all sub-domain classifiers in the set of N sub-domain classifiers. Wherein at least two sub-domain classifiers are included in at least one of the N sub-domain classifier groups.

In the implementation process, the sub-domain classifiers in each priority group identify the text according to a certain sequence. In the implementation process, once the domain identification result is obtained after the domain identification of the sub-domain classifier in a certain priority group, the obtained domain identification result can be returned without the need of submitting the text to the sub-domain classifier in the next priority group for domain identification, so that on the basis of ensuring that the accurate domain identification result is obtained, fewer sub-domain classifiers are used.

In one exemplary implementation, at least two sub-domain classifiers in at least one of the N sub-domain classifier groups perform domain recognition on text in parallel. In an exemplary implementation of the embodiment of the present application, not all of the priority groups may include multiple sub-domain classifiers, i.e., at least one priority group may include multiple sub-domain classifiers. The more the number of sub-domain classifiers for performing domain recognition on the text in parallel, the more accurate the obtained domain recognition result.

In one exemplary implementation, the domain identification accuracy of the sub-domain classifiers in the low priority group is lower than the domain identification accuracy of the sub-domain classifiers in the high priority group in the N sub-domain classifier groups. Because the accuracy rate of the sub-domain classifier in the high priority group for domain identification is higher than that of the sub-domain classifier in the low priority group for domain identification. Therefore, the progressive domain identification process can effectively reduce the working pressure of the sub-domain classifier with lower domain identification accuracy, and further improve the accuracy of the whole domain identification process.

In one exemplary implementation, at least one of the N sub-domain classifier groups includes a first sub-domain classifier and a second sub-domain classifier. When the first sub-domain classifier performs domain recognition on the text to obtain a first domain recognition result and the second sub-domain classifier performs domain recognition on the text to obtain a second domain recognition result, determining at least one of the first domain recognition result and the second domain recognition result as the domain recognition result; or determining that the first domain identification result and the second domain identification result are both domain identification results. Therefore, in the case that the domain recognition results are obtained by the multiple sub-domain classifiers in the same priority group, the domain recognition results may be selected based on a preset rule or a configured aggregate decision mode, for example, one of the domain recognition results is selected as a final domain recognition result, or a plurality of or all of the domain recognition results are selected as a final domain recognition result, and the rule or decision mode is not limited.

In one exemplary implementation, each of the at least two sub-domain classifiers performs domain recognition on text, which may be implemented as: naming the text identifies the NER and determines common features in the identified content. And then replacing the common features according to preset rules. The preset rules comprise replacement contents corresponding to public features of different categories. And extracting the characteristics of the text which is completed with the replacement, determining the weight of each characteristic, and calculating the value of the text according to the weight of each characteristic. And when the value of the text is larger than the threshold value, determining that the text belongs to the field corresponding to the sub-field classifier. It should be noted that, by adopting the mode of replacing the public features, the value of the calculated text is a calculation resource occupied, and the influence of the function features on the domain identification process can be effectively reduced, so that the accuracy rate of domain identification on the text is improved.

In one exemplary implementation, at least two sub-domain classifiers may be trained in advance before text is subjected to parallel domain recognition by the at least two sub-domain classifiers. The training process for each sub-domain classifier is as follows:

positive and negative samples of the sub-domain classifier are generated. It should be noted that each sub-domain classifier may have its own independent positive and negative samples, where the positive and negative samples include a positive training sample set and a negative training sample set. The samples in the positive training sample set are samples belonging to the corresponding field of the sub-field classifier, and the samples in the negative training sample set are samples not belonging to the corresponding field of the sub-field classifier.

And performing NER and rule extraction on the positive and negative samples, and performing public feature replacement on the positive and negative samples processed by the NER. The common feature refers to content that affects a value of a text when the value is calculated, but the existence of the feature does not affect the field to which the text belongs, and in one implementation manner of the embodiment of the present application, the common feature includes, but is not limited to, terms such as time, place, and the like, and may be preset. In the embodiment of the present application, common features may be replaced by symbols or the like, which are not limited herein. Rules include, but are not limited to, sentences such as "search for pictures … …". It should be noted that, performing NER on positive and negative samples may be a precondition for rule extraction and common feature replacement. Namely, the NER identifies the place, time, sentence pattern and the like in the positive and negative samples, then takes the sentence pattern as a rule, takes the time, place and the like as common features, and completes the replacement between the common features and the symbols.

Stop words, etc. That is, in the process of training the sub-domain classifier, for the positive and negative samples, in order to reduce the word of the mood such as "o", "ya" and the like and "; the symbols ", etc. interfere with the recognition process by requiring that these stop words be recognized and ignored in the domain recognition process.

And extracting the features to generate a corpus feature library, and calculating a value corresponding to the text according to the weights. The corpus feature library is used for storing the corresponding relation between the features and the weights.

And training the sub-domain classifier, evaluating the influence of the error domain recognition result, and then modifying the positive and negative samples.

The training process can dynamically adjust the distribution condition of the positive and negative samples, so that the recognition accuracy of the sub-field classifier is improved.

In a second aspect, an embodiment of the present application provides an electronic device. The electronic device can realize the functions realized in the method embodiment, and the functions can be realized by hardware or can be realized by executing corresponding software by hardware. The hardware or software comprises one or more modules corresponding to the functions.

In a third aspect, an embodiment of the present application provides an electronic device. The electronic device includes a memory and one or more processors in its architecture. Wherein the memory is for storing computer program code comprising computer instructions. The one or more processors, when executing and reading the computer instructions, cause the electronic device to implement the method of the first aspect and any of the various exemplary implementations thereof.

In a fourth aspect, embodiments of the present application provide a readable storage medium comprising instructions. The instructions, when executed on an electronic device, cause the electronic device to perform the method of any of the above first aspect and its various exemplary implementations.

In a fifth aspect, embodiments of the present application provide a computer program product comprising software code for performing the method of the first aspect and any of the various exemplary implementations thereof.

Drawings

Fig. 1 is a schematic structural diagram of a terminal according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of an exemplary method according to an embodiment of the present application;

FIG. 3 is a flowchart of an exemplary method for processing voice commands by a mobile phone according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an exemplary domain-aware multi-classification system according to an embodiment of the application;

FIG. 5 is a schematic diagram of a text field recognition implementation process using the system shown in FIG. 4 according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for training a sub-domain classifier under the condition that the text is known to belong to the domain, according to an embodiment of the present application;

FIG. 7 is a flowchart of a training method for adjusting positive and negative samples of a sub-domain classifier according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of another electronic device according to an embodiment of the present application.

Detailed Description

The embodiment of the application can be used for electronic equipment, which can be a terminal, such as a notebook computer, a smart phone, virtual Reality (VR) equipment, augmented Reality (Augmented Reality, AR), vehicle-mounted equipment, intelligent wearable equipment and the like. The terminal may be provided with at least a display, an input device and a processor, and as an example of the terminal 100, as shown in fig. 1, the terminal 100 includes a processor 101, a memory 102, a camera 103, an RF circuit 104, an audio circuit 105, a speaker 106, a microphone 107, an input device 108, other input devices 109, a display 110, a touch panel 111, a display panel 112, an output device 113, and a power supply 114. The display 110 is composed of at least a touch panel 111 as an input device and a display panel 112 as an output device. It should be noted that the terminal structure shown in fig. 1 does not constitute a limitation of the terminal, and may include more or less components than those shown in the drawings, or may combine some components, split some components, or different component arrangements, which are not limited herein.

The following describes the respective constituent elements of the terminal 100 in detail with reference to fig. 1:

the Radio Frequency (RF) circuit 104 may be used to receive and send information or receive and send signals during a call, for example, if the terminal 100 is a mobile phone, the terminal 100 may receive downlink information sent by a base station through the RF circuit 104 and then send the downlink information to the processor 101 for processing; in addition, data relating to uplink is transmitted to the base station. Typically, RF circuitry includes, but is not limited to, antennas, at least one amplifier, transceivers, couplers, low noise amplifiers (Low Noise Amplifier, LNAs), diplexers, and the like. In addition, the RF circuitry 104 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol including, but not limited to, global system for mobile communications (Global System for Mobile communication, GSM), general packet radio service (General Packet Radio Service, GPRS), code division multiple access (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), long term evolution (Long Term Evolution, LTE), email, short message service (Short Messaging Service, SMS), and the like.

The memory 102 may be used to store software programs and modules, and the processor 101 executes various functional applications and data processing of the terminal 100 by executing the software programs and modules stored in the memory 102. The memory 102 may mainly include a storage program area that may store an operating system, application programs required for at least one function (e.g., a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data (e.g., audio data, video data, etc.) created according to the use of the terminal 100, and the like. In addition, memory 102 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

Other input devices 109 may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the terminal 100. In particular, other input devices 109 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, a light mouse (a light mouse is a touch-sensitive surface that does not display visual output, or is an extension of a touch-sensitive surface formed by a touch screen), and so forth. Other input devices 109 may also include sensors built into the terminal 100, such as a gravity sensor, an acceleration sensor, etc., and the terminal 100 may also take as input data parameters detected by the sensors.

The display 110 may be used to display information input by a user or information provided to the user and various menus of the terminal 100, and may also accept user inputs. In addition, the display panel 112 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), or the like; the touch panel 111, also referred to as a touch screen, a touch sensitive screen, or the like, may collect touch or non-touch operations on or near the user (e.g., operations of the user using any suitable object or accessory such as a finger, a stylus, or the like on or near the touch panel 111, and may also include somatosensory operations; the operations include operation types such as single-point control operations, multi-point control operations, or the like), and drive the corresponding connection devices according to a preset program. It should be noted that the touch panel 111 may further include a touch detection device and a touch controller. The touch detection device detects the touch azimuth and the touch gesture of a user, detects signals brought by touch operation and transmits the signals to the touch controller; the touch controller receives touch information from the touch detection device, converts it into information that can be processed by the processor 101, and transmits it to the processor 101, and also receives and executes commands sent from the processor 101. In addition, the touch panel 111 may be implemented by various types such as resistive, capacitive, infrared, and surface acoustic wave, and the touch panel 111 may be implemented by any technology developed in the future. In general, the touch panel 111 may overlay the display panel 112, and a user may operate on or near the touch panel 111 overlaid on the display panel 112 according to content displayed on the display panel 112 (including but not limited to a soft keyboard, a virtual mouse, virtual keys, icons, etc.), and after the touch panel 111 detects the operation thereon or thereabout, the operation is transmitted to the processor 101 to determine user input, and then the processor 101 provides a corresponding visual output on the display panel 112 according to the user input. Although in fig. 1, the touch panel 111 and the display panel 112 are two independent components to implement the input and output functions of the terminal 100, in some embodiments, the touch panel 111 and the display panel 112 may be integrated to implement the input and output functions of the terminal 100.

RF circuitry 104, speaker 106, and microphone 107 may provide an audio interface between the user and terminal 100. The audio circuit 105 may transmit the received audio data converted signal to the speaker 106, and the audio data is converted into a sound signal by the speaker 106 and output; alternatively, microphone 107 may convert the collected sound signals into signals that are received by audio circuit 105 and converted into audio data that is output to RF circuit 104 for transmission to a device such as another terminal, or output to memory 102 for further processing by processor 101 in conjunction with content stored in memory 102. In addition, the camera 103 may acquire image frames in real time and transmit the image frames to the processor 101 for processing, and store the processed results in the memory 102 and/or present the processed results to the user through the display panel 112.

The processor 101 is a control center of the terminal 100, connects various parts of the entire terminal 100 using various interfaces and lines, and performs various functions of the terminal 100 and processes data by running or executing software programs and/or modules stored in the memory 102 and calling data stored in the memory 102, thereby performing overall monitoring of the terminal 100. It should be noted that the processor 101 may include one or more processing units; the processor 101 may also integrate an application processor that primarily processes operating systems, user Interfaces (UIs), applications, etc., and a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 101.

The terminal 100 may also include a power supply 114 (e.g., a battery) for powering the various components, wherein in embodiments of the application the power supply 114 may be logically coupled to the processor 101 via a power management system such that charge, discharge, and power consumption functions are managed by the power management system.

In addition, there are components not shown in fig. 1, for example, the terminal 100 may further include a bluetooth module, etc., which will not be described herein.

The following describes an embodiment of the present application by taking the terminal 100 as a mobile phone.

At present, after a mobile phone sends a received voice instruction to a cloud end, the cloud end converts the voice instruction into a text through a voice recognition technology, and then processes the text to determine a function which needs to be executed by the mobile phone and corresponds to the text, namely a processing result. The process of processing the text by the cloud can be realized by matching the text with the set content in the template piece by the cloud, and a processing result is finally obtained; or the cloud extracts keywords and keywords in the text, and then obtains a processing result based on the keywords and keywords. And the cloud returns the processing result to the mobile phone, and the mobile phone realizes the function corresponding to the processing result.

Therefore, the process of converting the voice command into the text and the subsequent processing process aiming at the text can occur in the cloud, and the mobile phone only needs to send the received voice command to the cloud, receive the processing result sent by the cloud after the processing of the voice command is completed in the cloud, and execute the corresponding function aiming at the processing result.

In the implementation process, data transmission can be realized between the mobile phone and the cloud end through a network, so that for the situation that the mobile phone cannot be connected with the network, the mobile phone cannot be ensured to accurately and effectively execute the function corresponding to the voice instruction.

In addition, whether the mobile phone needs to be connected with a network to process voice instructions or not, for the situation that text processing is completed in a template matching mode in the cloud and the situation that text processing is completed in a template matching mode in the local place of the mobile phone, a large number of manpower and material resources are often occupied when the fact that templates are mostly obtained through manual data is considered; in addition, the template is fixed after being generated, when the structure of the voice instruction is not completely matched with the structure of the template, the failure rate of the processing process is increased, namely, the text processing process is performed by adopting the template, the flexibility is poor, and the time consumed is long due to the fact that texts are required to be matched with more templates. Similarly, similar problems occur for text processing by extracting keywords or keywords.

In the process of template matching of text, when text content relates to multiple fields, ambiguity is easy to occur, namely, the recognition rate is low. For example, the text is "how good in english", which relates to translation, and also relates to language setting, and the cloud recognizes the text as "setting language"; the text is translated into an open eye protection mode, and the text is identified as the open eye protection mode by the cloud end, which relates to translation and mode starting; the text is 'help me record the geographical position of a restaurant', the position information is related, the record is also related, and the cloud end identifies the text as 'global positioning system (Global Positioning System, GPS)'; the text is 'the sending microblog words with large fonts', the fonts are related, the font adjustment is also related, and the cloud end recognizes the text as 'the fonts'; the text is used for reminding the inventor of opening the flight mode in the afternoon, the time is related, the mode is started, and the cloud identifies the text as an opening flight mode. Therefore, when the text relates to a plurality of fields, the cloud or the mobile phone can hardly accurately determine the function corresponding to the text.

Where the field refers to the type of text. The division of the type can be according to the language environment in which the text is located, in the embodiment of the application, the domain can be used as the text type, and in the process of processing the text, one domain corresponds to one type of task, namely, the text belonging to the same domain is used as the same type of task to be processed by the same dialog engine. The processing of the text by the dialog engine may include parsing the text by natural language processing (Natural Language Process, NLP) techniques to output a processing result. The processing result may include a code of a function to be implemented by the mobile phone, so that the mobile phone invokes a corresponding function, a specific implementation manner may be that the dialog engine is used to parse a text in the field, determine a function to be executed corresponding to the text, and the dialog engine may also generate an instruction code corresponding to the function to be executed; the instruction code is code that the machine can recognize and perform the corresponding functions.

The code may be a binary form code or a high-level language code such as < play > < wangfei > < song > (instruction code generated for a voice instruction "i want to listen to music of Wang Fei" input by the user), and the present application is not limited.

In order to solve the above problems, an embodiment of the present application provides a speech recognition method. Fig. 2 is a schematic flow chart of an exemplary method according to an embodiment of the present application. After the mobile phone receives a voice command through a user voice entrance, the voice command is converted into a text by adopting a voice recognition technology, then the text is subjected to field recognition by the terminal, a field recognition result obtained through the field recognition is fed back to a dialogue engine corresponding to the field recognition result for processing, and finally the obtained processing result is fed back to the mobile phone.

It should be noted that, in the case that the user inputs a voice command through the portal provided by the third party application or the system application, the dialog engine may feed back the processing result to the portal-providing application, so that the portal-providing application may implement functions such as interface switching; for the case that the user inputs the voice command through the entry provided by the system-level display interface of the mobile phone, the mobile phone can be in the main interface or the system-level display interface such as the setting interface instead of the running interface of the application in the mobile phone when the user inputs the voice command, so that the dialogue engine can feed back the processing result to the system of the mobile phone, so that the system of the mobile phone can realize functions such as running a certain application, adjusting the font size of the display interface, and the like. For example, the interface presented to the user by the mobile phone is the main interface of the mobile phone, and the user runs the game application by inputting a voice command of "open the game application". After completing the voice recognition and the subsequent processing of the voice command, the mobile phone feeds back the processing result to the system of the mobile phone, and the game application is started by the system of the mobile phone.

The system application comprises but is not limited to an application program preloaded when the mobile phone leaves the factory and having a function of receiving voice instructions; the third party application includes, but is not limited to, an application program having a function of receiving a voice command, which is downloaded and installed by a user from a platform such as an application provision, an application program realizing a calling function by other applications in a mobile phone, and the like. In the embodiment of the application, the system-level display interface refers to interfaces except an application running interface in the mobile phone, such as a main interface, a setting interface and the like of the mobile phone; the running interface of the application includes, but is not limited to, an interface presented to the user through the mobile phone during or after the application is started, for example, a loading interface of the application, a setting interface of the application, and the like.

In the above-mentioned domain identification process, the referred domain includes, but is not limited to, setting (setting), non-disturbing (non-disturbing), gallery (gallery), translation (stock), weather (weather), computing (calculator), and encyclopedia (baike) etc. domains.

In the embodiment of the present application, the text converted by the speech recognition may be categorized into a predetermined category by recognizing the keywords in the text, or by using a template matching method, or by processing by a sub-domain classifier, where the predetermined category includes but is not limited to the above-mentioned exemplary domain. The implementation manners of keyword recognition and template matching can refer to the implementation manner of field recognition for texts in the prior art, and are not repeated here; the sub-domain classifier may be provided in the domain-identified multi-classification system shown in fig. 2, and the function of the sub-domain classifier will be described later, which will not be repeated herein.

The multi-classification system of domain recognition shown in fig. 2 aims at performing domain recognition on a text of which a mobile phone has completed conversion and outputting a corresponding domain recognition result. And then the mobile phone transmits the text to a dialogue engine corresponding to the domain identification result according to the domain identification result, and the processing result is obtained, so that the mobile phone invokes corresponding functions according to the indication of the processing result.

The user voice portal may be a general portal such as a voice assistant, or may be a local portal of a system application or a third party application in the mobile phone. For example, taking system application as an example, a user inputs voice in a gallery, so that the mobile phone can complete the function of searching pictures in the gallery.

As shown in fig. 3, a flowchart of a method for processing voice instructions by an exemplary mobile phone according to an embodiment of the present application is shown. Taking the Wi-Fi function of a mobile phone opened by a voice command of a user as an example, the voice command input by the user is "Wi-Fi is opened", and after voice recognition is carried out on the voice command, a text with the content of "Wi-Fi is obtained. The mobile phone carries out field recognition on the text, confirms that the field to which the text belongs is a set field, sends the text to a dialogue engine corresponding to the set field for processing, namely, the mobile phone sends a field recognition result obtained after the field recognition to a local multi-field semantic understanding dialogue engine corresponding to the set field for processing, and then the mobile phone executes corresponding functions according to the processing result. In the embodiment of the application, the dialogue engine can also prompt the user through voice playing the execution result after the mobile phone executes the corresponding function or popup dialogue box and other modes, and the mobile phone completes the execution of the corresponding function according to the voice instruction input by the user. For example, in the example shown in fig. 3, the mobile phone may play "Wi-Fi on" by way of voice play, or pop up a word including, for example, "Wi-Fi on" to prompt the user.

Compared with the process of locally realizing the domain identification of the mobile phone in the prior art, the method for carrying out the domain identification in the embodiment of the application is different from the prior art. In the prior art, the processing process of the voice instruction mainly depends on template matching, so that when the text structure and the template structure are different, the mobile phone cannot obtain an accurate processing result. In the embodiment of the application, a domain recognition and dialogue engine is introduced, wherein the domain recognition process not only considers template matching and keyword extraction, but also adopts a mode of co-working of parallel multi-sub-domain classifiers, one or more domains corresponding to texts are screened out from the multiple domains, and the texts are processed by the dialogue engine corresponding to each screened domain. In this way, for the case that the text structure and the template result cannot be completely matched, the mobile phone can still further analyze and process the text. It should be noted that, for a text related to multiple fields, the mobile phone can process the text from the perspective of multiple fields, rather than simply pushing the text to a dialog engine corresponding to one field for processing. Therefore, the voice recognition process provided by the embodiment of the application can effectively distinguish the fields of the texts, and then the text recognition process based on the fields can be completed more pertinently, so that the functions required to be executed by the electronic equipment are determined, and the accuracy of voice recognition is enhanced. And, the implementation may be performed locally on the electronic device. That is, even in the process that the electronic device cannot access the network, recognition of the voice command can be achieved without the aid of cloud processing capability, so that flexibility of voice recognition is improved.

It should be noted that, the example shown in fig. 3 is a mobile phone dialogue system interaction process, that is, a user inputs a voice command, performs a function corresponding to the voice command through processing of the mobile phone, and feeds back a result of the mobile phone performing the function to the user through a voice playing or displaying mode. The voice command input by the user interacts with the voice playing content or the display result output by the mobile phone, and the dialogue system is stroked. That is, the voice playing or displaying result output by the mobile phone provides an exemplary response mode for the mobile phone to the user, so as to respond to the voice command input by the user during or after the mobile phone executes the corresponding function.

As shown in fig. 4, an exemplary domain identification multi-classification system is provided in accordance with an embodiment of the present application. The purpose of the domain recognition multi-classification system is to complete the domain recognition of a text from the text. In the domain identification process, the system can be divided into three layers, namely a control layer, a classifier layer and an algorithm layer.

The function, action, etc. of each layer will be described below with respect to each layer involved in the system.

In one illustrative example, the control layer includes the following: text fast full precision matching, domain scheduling, classification decision, and data loader.

The text is matched quickly and fully, and the control layer can divide the field of the text directly according to common phrases, sentence patterns and the like, such as common and ambiguous fixed speeches, without further processing the text through a classifier layer. In the embodiment of the present application, the template for rapid full-precision matching of text may be preset, and the specific setting manner may refer to an existing manual template, for example, a template related to the template matching manner described in the background art, which is not described herein.

The domain scheduling function includes scheduling for sub-domain classifiers in each priority of the classifier layer, for example, after text is quickly and fully matched, the domain scheduling may schedule each sub-domain classifier in the priority 1 to process the text, and continuously schedule each sub-domain classifier in the priority 2 to process the text until the text is determined to belong to the domain or all the sub-domain classifiers in the classifier layer have processed the text and the text is not determined to belong to the domain. Furthermore, domain scheduling may also be used to invoke algorithms, rules, patterns, etc. involved in the sub-domain classifier for a single sub-domain classifier.

The algorithm, the rule, the mode and the like are used in the text processing process of the sub-domain classifier. In the embodiment of the application, when the text is matched with the rule or the text meets the rule, the classifier layer returns a domain identification result; when the text matches the pattern or the text satisfies the pattern, it can be determined that there is a greater probability that the text belongs to the field corresponding to the pattern. In one embodiment of the present application, the rule and the correspondence of the text play a decisive role in determining the field to which the text belongs, and the pattern and the correspondence of the text increase the accuracy of determining the field to which the text belongs, and specific implementation is described in the specific examples mentioned later, which are not repeated here.

That is, the domain scheduling is used for linking the control layer and the classifier layer, after the text is subjected to full text precision matching, the scheduling of the sub-domain classifier in each priority is sequentially realized according to the order of the priority of the classifier layer from high to low, and in the process of scheduling the sub-domain classifier for text processing, corresponding algorithms, rules, modes and the like are scheduled according to the requirements of the sub-domain classifier.

The main purpose of the classification decision, namely the summarization decision, is to determine the field of the text or determine that the text does not exist in the field of the text by combining the processing results obtained by each priority of the classifier layer under the condition that the control layer does not determine the field of the text after the text is subjected to quick full-precision matching.

For example, after each sub-domain classifier in the priority 1 processes the text, it is determined that the text belongs to the domain 1 and the domain 2, and then the classification decision can be used to determine how the text belongs to the domain when the obtained domain recognition result includes a plurality of domains after the text is processed by all the sub-domain classifiers in the same priority. In the embodiment of the present application, the classification decision may specify that the text belongs to one or more fields at this time, i.e., the classification decision may specify that the text belongs to field 1, field 2, or both field 1 and field 2.

For another example, after each sub-domain classifier in the priority 1 processes the text, the domain to which the text belongs is not determined, and after each sub-domain classifier in the priority 2 processes the text, the text is determined to belong to the domain 1, so that the classification decision can determine the domain to which the text belongs by summarizing the domain recognition results obtained by each sub-domain classifier in the priority 1 and the priority 2, that is, the classification decision summarizes the domain to which the text does not exist in the priority 1, and the domain 1 to which the text belongs exists in the priority 2, and finally the domain recognition result that the text belongs to the domain 1 is the text is determined.

In the embodiment of the application, an example can be generated in the voice instruction recognition process of the mobile phone, and the example can be a task to be processed, and the task is that the mobile phone carries out field recognition on the text converted by the voice instruction. In the same priority of the classifier layer, multiple sub-domain classifiers can process the same instance at the same time, namely, the mobile phone executes multiple tasks at the same time, so as to realize domain recognition of the text.

The data loader is used for acquiring data of various libraries required by the algorithm layer, models of sub-domain classifiers in the classifier layer and configuration information from the local part of the mobile phone, a network side or third party equipment such as a server. The sub-domain classifier refers to a classifier corresponding to each domain; configuration information includes, but is not limited to, initialization parameters of the respective models, and the like.

In addition, the control layer is used as a layer which interacts with other parts of the mobile phone in the system, the control layer can acquire texts obtained by completing voice recognition from the mobile phone, and after the system processes the texts, the domain recognition result, namely the classification result, can be fed back to the mobile phone.

It can be seen that the control layer is responsible for external business interaction interfaces, initializing data and model loading, domain classification task scheduling, distribution of sub-domain classifier classification tasks, and final summarized decision of all classification results returned.

In one illustrative example, the classifier layer includes a plurality of priorities, such as priority1 (priority 1), priority2 (priority 2), priority3 (priority 3). Wherein priority1 is greater than priority2 and greater than priority 3. In each priority, one or more instances of the class, namely the sub-domain classifier, may be set. Such as instance 11 of class, instance 12 of class, and instance 13 of class in priority 1.

And the classifier layer is used for realizing the classification of the text. In the actual classification process, the classifier layer supports multi-level multi-instance task classification, namely, as described in the above section, the classifier layer comprises a plurality of priority classifier groups, and in the classifier groups with different priorities, a plurality of parallel sub-domain classifiers exist, and the plurality of parallel sub-domain classifiers can be executed simultaneously, so that the summarization decision is realized in the text domain classification process.

In a single sub-domain classifier, rules, patterns, named entity recognition (Named Entity Recognition, NER), and prediction parts are included to enable extraction of sub-domain features and domain recognition. It should be noted that the same text may have the same sub-domain characteristics in different sub-domain classifiers in the same priority, and may also obtain different domain recognition results.

The sub-domain features include, but are not limited to, keywords in the text, that is, in different domains, the same keyword may represent the same meaning or different meanings, and in the embodiment of the present application, the keyword may have an influence on the domain recognition result; the domain recognition result refers to the processing of the text by the sub-domain classifier, and the domain to which the text may belong can be primarily predicted, for example, after two sub-domain classifiers belonging to the same priority process the same text, one sub-domain classifier determines that the text belongs to the domain 1, and the other sub-domain classifier determines that the text belongs to the domain 2, so that the two sub-domain classifiers obtain different domain recognition results, that is, the text belongs to the domain 1, and the text belongs to the domain 2.

In addition, for classifier groups of different priorities, there is a serial relationship between the classifier groups of adjacent two priorities. For the priority 1 with higher priority in the classifier layer, under the condition that the sub-domain classifier in the priority 1 obtains an effective domain identification result, the domain identification result can be fed back to the mobile phone through the control layer; under the condition that the sub-domain classifier in the priority level 1 does not obtain an effective domain recognition result, the text can be transmitted to each sub-domain classifier in the serial priority level 2 for processing, and the like until an effective domain recognition result position is obtained. After traversing each priority in the classifier layer, the text still does not obtain the domain identification result, and the classification result of the domain identification result which is not obtained can be fed back to the mobile phone. The valid domain identification result refers to that in a certain priority, the domain to which the text belongs can be determined, and then the determined domain is the valid domain identification result.

It should be noted that, the number of priorities located in the classifier layer and the number of sub-domain classifiers located in the same priority are not limited in the embodiment of the present application. In addition, a domain corresponding to each sub-domain classifier may be predefined. In addition, in the use process of the subsequent system, the fields corresponding to the sub-field classifiers can be adjusted, wherein the adjustment comprises, but is not limited to, adjustment of the priority of the sub-field classifier, adjustment of the corresponding field of the sub-field classifier, increase or decrease of the number of the sub-field classifiers, and the like. For example, moving one sub-domain classifier from one priority to another, exchanging sub-domain classifiers located in different priorities, etc.

In the actual configuration process, the domain with higher domain identification precision can be corresponding to the sub-domain classifier with high priority; and the model with better performance is corresponding to the sub-domain classifier with high priority. For the text in the field to be identified, the text can be identified with higher accuracy and timeliness by virtue of higher identification accuracy of the sub-field classifier with high priority and better performance of the used model. When the text can identify the domain in the highest priority sub-domain classifier, the system can return the domain identification result to the mobile phone. That is, after the sub-domain classifier with high priority is subjected to domain recognition, an effective domain recognition result is not obtained, and then the text can be passed through the sub-domain classifier of the next stage to perform domain recognition until an effective domain recognition result is obtained or the text is processed by the sub-domain classifier of each stage. The effective domain identification result refers to the domain corresponding to the text determined by the system; the highest priority sub-domain classifier refers to each sub-domain classifier such as in priority 1 of fig. 4, namely sub-domain classifier 11, sub-domain classifier 12 and sub-domain classifier 13. At the classifier layer, the text may be sequentially identified according to the order of the priorities of the groups of the sub-domain classifiers from high to low. Of course, when the text identifies the domain in the group of a certain sub-domain classifier, the domain identification process of the text can be ended.

In one illustrative example, an algorithm layer is used to provide algorithms, models. The model refers to databases such as rule (rule) library, named Entity (NE) library, and feature (feature) library, among others. The algorithms provided by the algorithm layer may also be embodied in the form of a database, such as a library of algorithm models. In the algorithm model library, a plurality of algorithms are included.

It should be noted that, before the above algorithms are invoked, the data loader of the control layer needs to load the content related to the algorithm into the system, so as to flexibly invoke the classifier in each sub-domain of the classifier layer.

As shown in fig. 5, a schematic diagram of an implementation flow of text field recognition using the system shown in fig. 4 is shown.

After the mobile phone inputs the text into the system shown in fig. 4, the system firstly carries out quick full-precision matching on the text at a control layer, and under the condition that the field can be successfully determined for the text, the obtained field is directly used as a recognition result; in the event that the domain cannot be successfully determined for the text, further processing of the text may continue through the classifier layer.

The further processing of the text can be implemented to sequentially perform domain recognition on the text according to the order of the priority groups of the sub-domain classifiers of the classifier layer from high to low. In the process of identifying the text, no matter in which priority group the text is subjected to the domain identification, the domain is fed back to the mobile phone as a domain identification result as long as the domain corresponding to the text is identified, and the text is not submitted to the next priority group for processing.

As shown in fig. 5, the system performs classification task scheduling for priority 1, that is, invokes the sub-domain classifier 11, the sub-domain classifier 12, and the sub-domain classifier 13, and performs parallel domain recognition on text input into the system. The parallel domain recognition refers to that the sub-domain classifier 11, the sub-domain classifier 12 and the sub-domain classifier 13 perform domain recognition on the text at the same time, or perform domain recognition on the text according to a certain time sequence, and then each sub-domain classifier outputs a domain recognition result, and the classification decision corresponding to the priority 1 realizes that 3 domain recognition results are output, so as to determine the domain recognition result fed back to the mobile phone, or input the text into the next priority group.

For the case of inputting text into the next priority group, the system will continue to perform classification task scheduling for priority 2, that is, invoking the sub-domain classifier 21, the sub-domain classifier 22 and the sub-domain classifier 23, and performing parallel domain recognition on the text input into the system. The parallel domain identification process can be implemented with reference to the description of the previous paragraph. Similarly, after the system finishes domain recognition aiming at the text, the system can input the text into each sub-domain classifier corresponding to the priority 3 for domain recognition, and can directly feed back the obtained effective domain recognition result to the mobile phone. The effective field recognition result refers to the field of obtaining the text through the text full-precision matching of the control layer; or the text field is obtained through the processing of one or more priority groups in the classifier layer; or after the text passes through the control layer and the classifier layer for each priority, the results of the text field are not obtained.

In the embodiment of the application, the process of text domain recognition by the system ends when an effective domain recognition result is obtained, or ends when an effective domain recognition result cannot be obtained after each priority group in the classifier layer performs domain recognition on the text.

When the sub-domain classifiers in the same priority group perform domain recognition on the text at the same time, the multiple sub-domain classifiers operate together in the same time period in the discrimination process of domain determination, so that the time occupied by the discrimination process can be effectively saved. When the sub-domain classifiers in the same priority group are used for carrying out domain identification on the text according to a certain time sequence, one sub-domain classifier operates in a period of time, so that the system can be ensured to occupy less resources for the operation of a single sub-domain classifier in the period of time, and enough resources for other systems or programs to call are ensured to exist in the mobile phone.

Referring to the system shown in fig. 4 and the method flow shown in fig. 5, it can be known that, compared with the scheme of implementing speech recognition by cloud in the prior art, the system provided by the embodiment of the application has higher expandability, higher flexibility, higher accuracy and finer definition.

The expansibility is higher, which means that the system can support any expansion of new sags in the future, and the existing model, namely the system provided by the embodiment of the application, does not need to be re-established. The above-mentioned verticals refer to the categories of different fields related in the embodiments of the present application, such as the fields of setting, no disturbance, gallery, translation, stock, weather, calculation, encyclopedia, and the like. That is, in the subsequent use process, the sub-domain classifier corresponding to other domains can be added in the classifier layer according to the requirements of different application scenes.

The flexibility is high, which means that different priority groups can be flexibly adjusted according to the specific of the current and future sags, for example, the number of sub-domain classifiers in a single priority group is increased or decreased, the sub-domain classifiers between multiple priority groups are exchanged, and the like, which is not limited herein. Therefore, the classifier layer can be ensured to obtain a relatively accurate domain identification result after summarizing and deciding.

The accuracy is higher, which means that for a single sub-domain classifier, a specific analysis and calculation mode can be adopted to process the text by combining the features of the corresponding domain of the sub-domain classifier, for example, the processing of numbers and stop words (stop words), the selection of bi-grams and tri-grams, the feature extraction range and mode, and the like. Because the same or different processing modes are adopted in the classifiers in different sub-fields, the classifier has more pertinence, and therefore, the accuracy is relatively high.

More refined, refer to screening training data, and training and optimizing the sub-domain classifier more specifically, so that the sub-domain classifier can refine the domain recognition process of the text to achieve more accurate domain recognition.

The process of domain recognition of text by the above-described system is described below in connection with illustrative examples.

In the embodiment of the application, the fields corresponding to the sub-field classifiers in the classifier layer can be preconfigured. For example, the sub-domain classifiers with higher domain recognition accuracy may be placed in a higher-priority packet, such as priority 1, and the sub-domain classifiers with lower domain recognition accuracy may be placed in a lower-priority packet, such as priority 3, in order of higher domain recognition accuracy of the sub-domain classifiers. In one exemplary implementation, the handset may place the own class classification task with highest classification accuracy in priority 1, class classification tasks belonging to interfacing with the application in priority 2, and class tasks that are most difficult to identify in priority 3.

After the task interfacing with the application is processed by the sub-domain classifier in the priority group, no matter which sub-domain classifier in the priority group is the obtained domain division result, the processing process of the voice instruction is not greatly affected, or the influence is hardly generated. Since the sub-domain classifier set in the priority 2 corresponds to an application, text processing in a different domain under the same application is typically performed by a dialog engine corresponding to the application. That is, the text belongs to any field corresponding to the application, and finally is processed by the same dialog engine. Therefore, in one implementation of the embodiment of the present application, the task of vertical classification interfacing with an application may be regarded as a task with low requirement on domain recognition accuracy, because no matter what domain the final domain recognition result is, as long as the valid domain recognition result is generated in priority 2, the text is finally submitted to the same dialog engine for processing, and the processing result is not affected.

For example, the sub-domain classifier 21, the sub-domain classifier 22 and the sub-domain classifier 23 are included in the priority 2, wherein the sub-domain classifier 21 corresponds to the domain of stock, the sub-domain classifier 22 corresponds to the domain of translation, the sub-domain classifier 23 corresponds to the domain of calculation, and the stock, translation and calculation correspond to the same application, i.e. the same dialog engine. That is, in priority 2, no matter which field of stock, translation and computation the text is determined to belong to, the final handset will push the text to the same dialog engine for processing. As can be seen from this, the processing result of text processing by the subsequent dialog engine is not affected by the domain recognition result obtained by the domain recognition of text by the priority 2, regardless of the domain involved in the priority 2.

It should be noted that, the domain corresponding to each sub-domain classifier in the priority 2 may also correspond to two or three dialog engines, but there may be a case where domains corresponding to a plurality of sub-domain classifiers correspond to the same dialog engine.

The vertical class classification task is executed by a sub-domain classifier corresponding to the domain; the sub-domain classifier corresponding to the self-dropping classification task can include, but is not limited to, a sub-domain classifier corresponding to a function to which the self-dropping function of the mobile phone belongs, for example, a sub-domain classifier corresponding to the setting, non-disturbing and gallery domain; the sub-domain classifier corresponding to the vertical class classification task of the third party docking can comprise, but is not limited to, an application program installed in a mobile phone, or a sub-domain classifier corresponding to functions such as a program which can be directly called without downloading, such as an applet, and the like, for example, a sub-domain classifier corresponding to stock, translation, calculation and weather domains; the sub-domain classifier corresponding to the most difficult-to-identify vertical class task may include, but is not limited to, a sub-domain classifier corresponding to a domain where a domain identification result is difficult to determine according to a keyword, for example, a sub-domain classifier corresponding to a domain having a search function such as encyclopedia.

It follows that in one exemplary implementation, the distribution of the individual sub-domain classifiers in the classifier layer is as follows:

priority 1: setting sub-domain classifiers corresponding to the non-disturbing gallery domains respectively;

priority 2: sub-domain classifiers corresponding to the stock, translation, calculation and weather domains respectively;

priority 3: sub-domain classifiers corresponding to encyclopedia domains.

For example, when the text input into the system is "i want to set a no-disturbance" the control layer does not obtain an effective domain recognition result through the rapid full-precision matching of the text. When the text is processed by the classifier layer, the effective domain identification result is obtained through parallel processing of all the sub-domain classifiers in the priority1, namely, the domain corresponding to the text is free of disturbance. And then the system feeds back the obtained domain identification result to the mobile phone. The expression of the above example is as follows:

text corresponding to voice input by the user: i want to set a do-not-disturb

The process comprises the following steps: [ I set a do not disturb ] < nodisturb, priority1 ]

And after an effective domain identification result is obtained from the sub-domain classifier in the priority1, the effective domain identification result is directly returned to the mobile phone.

Domain identification result: [ noditurb ] return from < priority1 ]

The above domain identification process involves the processing of each sub-domain classifier in the controller layer and the classifier layer priority 1.

For another example, when the text input to the system is "watch stock market", the control layer does not obtain an effective domain recognition result through the rapid full-precision matching of the text. When the text is processed by the classifier layer, the obtained domain recognition result is other (other) through parallel processing of all the sub-domain classifiers in the priority 1. The text is passed to the next-priority sub-domain classifier for processing. And then, carrying out parallel processing on all the sub-domain classifiers in the priority 2 to obtain an effective domain identification result, namely, the domain corresponding to the text is stock. And then the system feeds back the obtained domain identification result to the mobile phone. The expression of the above example is as follows:

text corresponding to voice input by the user: look at stock market

The process comprises the following steps: [ seeing stock market ] < other, priority1 ]

And (3) not obtaining an effective domain identification result from the sub-domain separator in the priority1, wherein the obtained domain identification result is other, and the text is delivered to the sub-domain classifier in the priority 2 for processing.

[ stock seen in stock market ] < priority2 ]

And after an effective domain identification result is obtained from the sub-domain classifier in the priority2, the effective domain identification result is directly returned to the mobile phone.

Domain identification result: [ stock ] return from < priority2 ]

The above domain identification process involves the processing of the classifier in each sub-domain of the classifier layer priority 1 and priority 2.

In the embodiment of the application, if the sub-domain classifier corresponding to the encyclopedia is included in the priority2, the domain identification result obtained from the sub-domain classifier in the priority2 includes stocks and encyclopedia or one of the stocks and the encyclopedia, and then the domain identification result returned to the mobile phone has a larger error probability. It follows that in the classifier layer, prioritization of sub-domain classifiers is important. For areas that are easily ambiguous or difficult to resolve, the sub-area classifier corresponding to the area may be placed into a lower priority packet. Therefore, after the high-priority packet obtains an effective domain identification result, the text does not need to be input into the low-priority packet for domain identification, and the low-priority domain identification pressure is reduced.

For another example, when the text input to the system is "query wuliangye", the control layer does not obtain an effective domain recognition result through the rapid full-precision matching of the text. When the text is processed by the classifier layer, the domain recognition result obtained by parallel processing of all the sub-domain classifiers in the priority1 is other. And (3) delivering the text to the sub-domain classifier of the next priority for processing, namely, delivering the text to each sub-domain classifier of the priority 2 for parallel processing, wherein the obtained domain recognition result is still other. And then the text is submitted to a next-priority sub-domain classifier for processing, and an effective domain recognition result is obtained through parallel processing of each sub-domain classifier in the priority 3, namely the domain corresponding to the text is encyclopedia. And then the system feeds back the obtained domain identification result to the mobile phone. The expression of the above example is as follows:

text corresponding to voice input by the user: query wuliangye

The process comprises the following steps: [ query wuliangye ] < other, priority1 ]

[ query wuliangye ] < other, priority2 ]

And (3) not obtaining an effective domain identification result from the sub-domain separator in the priority2, wherein the obtained domain identification result is other, and the text is sent to the sub-domain classifier in the priority3 for processing.

[ query wuliangye ] < baike, priority3 ]

And after an effective domain identification result is obtained from the sub-domain classifier in the priority3, the effective domain identification result is directly returned to the mobile phone.

Domain identification result: [ baike ] return from < priority3 ]

The above domain identification process involves the processing of the classifier in each sub-domain of the classifier layer 1, 2, and 3. In addition, the text is ambiguous and there is a certain probability that it will be identified as stock, encyclopedia, and other. In the embodiment of the application, the domain which is most easily identified by the text is placed in the lowest priority of the classifier layer, so that the conflict among all priority groups can be effectively reduced, and the identification pressure of the sub-domain classifier with high priority of the upper layer is reduced.

In the implementation process, the mobile phone can fully utilize local user data, and can effectively perform field recognition under the condition that the mobile phone does not interact with the cloud. The local user data refers to data stored locally in the mobile phone, for example, data stored in a memory of the mobile phone. This data includes, but is not limited to, the content contained in the various libraries involved in the system. Therefore, the mobile phone saves time consumed by data interaction with the cloud, and in the domain identification process, the plurality of sub-domain classifiers with the same priority can complete identification operation at the same time, so that time consumed by the domain identification process can be effectively saved.

In the embodiment of the application, the priorities of the classifier layers can be classified according to the characteristics of different domain categories and the accuracy and the performance of the corresponding models of the classifier in each sub-domain. For common speaking or fixed speaking which is easy to generate ambiguity, the text full-precision matching process of the method at the control layer can be set, so that the processing efficiency of the field recognition is effectively improved, and the time occupied by the field recognition process is saved. In addition to the above method, the sub-domain identification classifier with different priorities can be sequentially entered into the multi-domain parallel domain identification process according to the order of the priorities from high to low, so that the processing efficiency of the domain identification process is further improved, and the processing time is saved. The above-mentioned priority classification can also make effective use of sub-domain classifiers having poor classification effects, i.e., placement into groups having lower priorities.

In the system, the recognition capability of the sub-field classifier can influence the field recognition result, and the training of the sub-field classifier can influence the recognition capability of the sub-field classifier, so that the training of the sub-field classifier is particularly important.

As shown in fig. 6, an exemplary method for training a sub-domain classifier under the condition that the text is known to belong to the domain is provided in the embodiment of the present application. Wherein the method flow includes S201 to S208.

S201, inputting a text.

S202, screening the text through rules.

In an embodiment of the application, the rule may be a sentence in the form of [ ] (search | look | tell | open) {1, 12} (strand) $ ]. Wherein, "≡" is used as the initiator of the rule and indicates that the text of the "strand" two words, which is the end keyword, is followed by 1 to 12 words at intervals with the "search", "look", "tell" or "open" as the start keyword, and "$" is used as the terminator of the rule and indicates that the text ends with the "strand".

The initial keyword refers to the fact that the first word in the text is "search", "look", or the first word in the text is "tell" or "open", and the searched content is considered as the initial keyword; the end keyword refers to the strand in the text whose last word is "strand".

It should be noted that, the start symbol and the end symbol appear as optional symbols in the sentence pattern, and are not limiting to the embodiment of the present application. For example, the rule may be a sentence in the form of [ (search look tells open), {1, 12 }). Then, the sentence represents the text of the "stock" two words with "search", "look", "tell" or "open" as the start keyword, followed by 1 to 12 words apart, as the end keyword. Wherein the first word in the text is not necessarily "search", "look", or the first word in the text is not necessarily "tell" or "open", but there is "search", "look", "tell" or "open" in the text. Also, in the text, after 1 to 12 words after "search", "look", "tell" or "open", there is a "strand" of two words, and the "strand" is not necessarily the last word appearing in the text.

That is, in the rule, the initiator, the terminator, or both the initiator and the terminator may be included, which is not limited herein.

And S203, returning a domain identification result when the text meets the rule.

With reference to the above description, for a text that can match an upper rule, the domain to which the text belongs may be directly determined, thereby determining a domain recognition result, and returning.

S204, when the text does not meet the rule, NER is carried out on the text, and common feature replacement is completed.

The common feature refers to content that affects a value of a text when the value is calculated, but the existence of the feature does not affect the field to which the text belongs, and in one implementation manner of the embodiment of the present application, the common feature includes, but is not limited to, terms such as time, place, and the like, and may be preset. In the embodiment of the present application, common features may be replaced by symbols or the like, which are not limited herein.

In one implementation manner of the embodiment of the application, the NER is performed on the text, words such as time, place and the like in the text can be identified, the identified content is used as a public feature, and the public feature is exchanged by using a preset symbol and the like.

S205, extracting the characteristics of the text with the replacement completed.

Feature extraction refers to word extraction of a text subjected to replacement in a binary method, a ternary method and the like, for example, word extraction is performed on the text subjected to replacement in a binary method to obtain a plurality of groups of words consisting of two words, or a combination consisting of one word and one symbol, or a combination consisting of two symbols and the like.

S206, calculating the weight of each feature.

It should be noted that, the manner of calculating the weight according to the features may refer to the implementation manner such as the binary method and the ternary method in the prior art.

For example, taking a binary method as an example, the numerical value corresponding to each feature obtained by splitting through the binary method is input into a model, and the model outputs a weight corresponding to the number of features, namely one feature corresponds to one weight through calculation of an algorithm such as linear regression (Linear Regression, LR). The parameters of the input model may be different in numerical values corresponding to different features, and specific setting modes are not limited herein. In the embodiment of the present application, the mode of model calculation may refer to an algorithm provided in the prior art, for example, the LR algorithm described above, which is not described herein.

S207, calculating a value corresponding to the text according to the weight.

The value of the replaced text, i.e. the value of the input text, is calculated from the weights of the features. For example, a method of summing weights of all features in the text to obtain a value corresponding to the text, or a method of summing weights of the features to obtain a value corresponding to the text, etc. may be used as a specific calculation method, which is not limited herein.

S208, adjusting the sub-domain classifier according to the calculated value and the known domain identification result.

Since the above-mentioned S201 to S208 train the sub-domain classifier according to the text of the known domain, the sub-domain classifier can be adjusted according to the recognition result obtained by the sub-domain classifier and the domain to which the text actually belongs. The manner of adjusting the sub-domain classifier includes, but is not limited to, adjusting positive and negative samples in the sub-domain classifier. It should be noted that, adjusting positive and negative samples affects the weights of the features, and ultimately affects the calculated values corresponding to the text, thereby affecting the domain recognition result.

After the adjustment process is completed, the mobile phone can continue to use the same text, and the same sub-domain classifier is processed again until a correct domain identification result is obtained. That is, in the training process of the sub-domain classifier, the above-described steps S201 to S208 are repeated until the training purpose is achieved.

Fig. 7 is a flowchart of an exemplary training method for adjusting positive and negative samples of a sub-domain classifier according to an embodiment of the present application. Wherein the method flow includes S301 to S310.

S301, generating positive and negative samples of the sub-field classifier.

In the embodiment of the application, each sub-field classifier can have independent positive and negative samples, wherein the positive and negative samples comprise a positive training sample set and a negative training sample set. The samples in the positive training sample set are samples belonging to the corresponding field of the sub-field classifier, and the samples in the negative training sample set are samples not belonging to the corresponding field of the sub-field classifier.

S302, NER and rule extraction are carried out on the positive and negative samples.

For example, the text content of the positive and negative samples is "photo of searching for Tiananmen", after NER, it is identified that "Tiananmen" is obtained, and the rule extracted by rule extraction is sentence pattern in [ ζ (search) {1, 10} (photo) $ ]. Thus, the "Tiananmen" obtained after NER can be used as a public feature, and [ ≡search ] {1, 10} (photo) $ ] can be used as a rule.

S303, completing the replacement of the public features.

In one implementation of the embodiment of the present application, the place name such as Tiananmen may be predefined to be replaced with #, and then the text content that completes the common feature replacement is "search # photo".

Here, S302 and S303 may refer to the descriptions of S202 to S205 above, and are not described herein.

It should be noted that, performing NER on positive and negative samples may be a precondition for rule extraction and common feature replacement. Namely, the NER identifies the place, time, sentence pattern and the like in the positive and negative samples, then takes the sentence pattern as a rule, takes the time, place and the like as common features, and completes the replacement between the common features and the symbols.

S304, denoising stop words and the like.

In the embodiment of the present application, the stop word refers to a word, a word or a symbol which does not have a decisive effect on the field recognition, but the presence of the word, the word or the symbol may often have an influence on the accuracy of the field recognition result, for example, "; "," and the like. These stop words are identified and ignored during the domain identification process.

S305, extracting features to generate a corpus feature library.

The corpus feature library is used for recording the corresponding relation between the features and weights obtained through calculation in the step S206.

S306, calculating a value corresponding to the text according to the weight.

S307, training the sub-domain classifier.

The specific training process may refer to the implementation process of S201 to S207, which will not be described herein.

S308, evaluating the influence of the error domain identification result.

S309, modifying the positive and negative samples.

Similar to the purposes of S208, S308 and S309 described above, in an embodiment of the present application, modifying positive and negative samples may be used as an exemplary implementation of S208.

The following describes the process of adjusting positive and negative samples for the training process of the sub-domain classifier for the above system in conjunction with an illustrative example.

In an exemplary implementation, taking a sub-domain classifier corresponding to a stock domain as an example, training a sample and a domain recognition result obtained after the sample is processed by a system, where the training sample includes the following contents:

the first round of processing results of the system on the training sample 1 and the training sample 2 are as follows:

training sample 1

Text corresponding to voice instructions input by a user: common flower and common stock market

Domain identification result: [ stock ] return from < priority2 ]

Training sample 2

Text corresponding to voice instructions input by a user: fried stock with same flower

Domain identification result: [ stock ] return from < priority2 ]

In the embodiment of the application, the "same flower sequence" is not only the name of a marketing company, but also the name of a certain application, wherein the application is used for stir-frying. In training sample 1, the user tries to query stocks in the same flower sequence; in training sample 2, the user wants to open an application named the same flower sequence and perform the stranding. Thus, the domain identification result obtained by training sample 1 is accurate, while the domain identification result obtained by training sample 2 is erroneous.

Taking a binary method as an example, dividing texts obtained by converting voice instructions input by a user according to the binary method to obtain a plurality of groups of characteristics consisting of two words.

In training sample 1, each feature and the weight corresponding to each feature are as follows:

and (3) flower: 0.33474357

Flower sequence: 0.23474357

And (3) stranding: 0.30918131

Stock market: 1.57149447

Value of text: 0.33474357+0.23474357+0.30918131+1.57149447= 2.45016292

In training sample 2, each feature and the weight corresponding to each feature are as follows:

and (3) flower: 0.33474357

Flower sequence: 0.23474357

Parching: -0.34392488

Strand frying: -1.34392488

Stock: 1.99415611

Value of text:

0.33474357+0.23474357-0.34392488-1.34392488+1.99415611＝1.87579349

it should be noted that the weight of the feature may be positive, negative, or 0. In the embodiment of the application, the larger the weight value of the feature is, the larger the contribution of the feature to the sub-domain (stock domain in the example) where the text is identified as being located is indicated.

In one implementation manner of the embodiment of the application, a threshold value of 1.5 is taken as a value of a text, and when the sum of weights of all features in the text is greater than or equal to 1.5, the text is confirmed to belong to the stock field; when the sum of the weights of all the features in this document is less than 1.5, it is confirmed that the text does not belong to the stock domain. Since the weight of each feature involved in the training sample 1 is a positive number, a correct domain identification result, that is, a domain which is a stock, can be calculated from the weights. In the training sample 2, the noise "stir-fried" exists, and the absolute value of the negative weight values of the features "stir-fried" and "stir-fried strand" combined with the "stir-fried" is too small, so that the training sample 2 obtains a positive number larger than a threshold value after the weights of the features are summed, and therefore, the text can still be mistakenly identified as the stock field.

In order to correct the problem of false recognition in the first round of processing, the system adjusts the positive and negative samples according to the result that the training sample 1 is correctly recognized and the training sample 2 is incorrectly recognized before the second round of processing is performed. Deleting the content with the same flower sequence in the positive sample, and adding the same flower sequence and the fried stock in the negative sample.

In general, the value of the weight corresponding to the feature related to the content added in the positive sample is increased, and the value of the weight corresponding to the feature related to the content deleted in the positive sample is decreased; similarly, the value of the weight corresponding to the feature related to the content added in the negative sample is reduced, and the value of the weight corresponding to the feature related to the content deleted in the negative sample is increased.

For example, if the content with the same flower sequence is deleted from the positive sample, the values of the weights corresponding to the features of the same flower sequence and the flower sequence are reduced. And the addition of the same flower sequence in the negative sample can further reduce the value of the weights corresponding to the characteristics of the same flower sequence and the same flower sequence respectively.

However, for adding [ stock-in-stock ] to the negative sample, in one implementation of the embodiment of the present application, the weights for the features "stock-in-stock" and "stock" respectively are not affected. The reason for the lack of influence may be that the sample size of the content with the stock in the positive sample and the negative sample is larger, resulting in less influence on the negative sample after adding one sample of the stock in the negative sample, for example, the number of samples of the content with the stock in the positive sample is twenty thousands, the number of samples of the content with the stock in the negative sample is ten thousands, after a negative sample of the stock is added, the positive sample and the negative sample of huge data are not influenced, so that the influence on the weights corresponding to the characteristics of stock is almost zero, and the values of the weights corresponding to the stock are not changed. Therefore, in one implementation manner of the embodiment of the present application, the above-mentioned deletion of the content with the same flower sequence in the positive sample and the addition of the same flower sequence and the fried stock in the negative sample can reduce the values of the weights corresponding to the features "same flower and the flower sequence" respectively, without affecting the values of the weights corresponding to the features "fried strand" and the stocks "respectively.

It should be noted that the foregoing is an exemplary implementation, and is not meant to limit embodiments of the present application.

After the first positive and negative sample adjustment, in the training sample 1, each feature and the weight corresponding to each feature are as follows:

and (3) flower: -0.34743574

Flower sequence: -0.34743574

And (3) stranding: 0.30918131

Stock market: 1.57149447

Value of text: -0.34743574-0.34743574+0.30918131+1.57149447= 1.1858043

After the first positive and negative sample adjustment, in the training sample 2, each feature and the weight corresponding to each feature are as follows:

and (3) flower: -0.34743574

Flower sequence: -0.34743574

Parching: -0.34392488

Strand frying: -1.34392488

Stock: 1.99415611

Value of text:

-0.34743574-0.34743574-0.34392488-1.34392488+1.99415611＝-0.38856513

after the first positive and negative sample adjustment, the change of the positive and negative samples results in the change of the weights of part or all of the features, so that the processing result is also affected to a certain extent. I.e. the value of the text in training sample 1 and training sample 2 is less than 1.5, meaning that both training samples are identified as not belonging to the stock domain. It should be noted that, for the same feature, when the feature belongs to a positive sample, the greater the weight corresponding to the feature is; when the feature belongs to a negative sample, the smaller the weight corresponding to the feature is; when the feature belongs to both positive samples and negative samples, weighting value of the feature is weighted according to the number of the positive samples and the negative samples containing the feature.

After the first positive and negative sample adjustment, the second round of processing results of the system on the training samples 1 and 2 are as follows:

training sample 1

Domain identification result: [ other ] return from < priority3 ]

Training sample 2

Domain identification result: [ other ] return from < priority3 ]

Wherein, the domain recognition result obtained by the training sample 1 is wrong, and the domain recognition result obtained by the training sample 2 is correct. It should be noted that, when the training sample is input, the correct domain identification result corresponding to the training sample can be input, so that the mobile phone can automatically adjust the positive and negative samples according to the known correct domain and the output domain identification result; or after outputting the domain identification result, manually judging whether the domain identification result is correct, and triggering the mobile phone to automatically adjust the positive and negative samples under the condition that the result is incorrect.

Thus, the system will again automatically adjust the positive and negative samples, i.e., the system will automatically adjust the positive and negative samples a second time. The system readjust the [ same flower sequence ] in the positive sample based on the first adjustment of the positive and negative samples, for example, increase the content of the [ same flower sequence ] included in the positive sample. Thus, the value of the weights corresponding to the features of the same flower and the flower sequence can be effectively improved.

After the second positive and negative sample adjustment, in the training sample 1, each feature and the weight corresponding to each feature are as follows:

and (3) flower: -0.03474357

Flower sequence: -0.03474357

And (3) stranding: 0.30918131

Stock market: 1.57149447

Value of text: -0.03474357-0.03474357+0.30918131+1.57149447= 1.81118864

After the second positive and negative sample adjustment, in the training sample 2, each feature and the weight corresponding to each feature are as follows:

and (3) flower: -0.03474357

Flower sequence: -0.03474357

Parching: -0.34392488

Strand frying: -1.34392488

Stock: 1.99415611

Value of text:

-0.03474357-0.03474357-0.34392488-1.34392488+1.99415611＝-0.76318079

after the second positive and negative sample adjustment, the third round of processing results of the system on the training samples 1 and 2 are as follows:

training sample 1

Text corresponding to voice input by the user: common flower and common stock market

Domain identification result: [ stock ] return from < priority2 ]

Training sample 2

Text corresponding to voice input by the user: fried stock with same flower

Domain identification result: [ other ] return from < priority3 ]

In the embodiment of the application, the system combines the correctness or the error of the domain identification result of each round to adjust the positive and negative samples until the training sample 1 and the training sample 2 obtain the correct domain identification result. It follows that the greater the number of training samples, the greater the accuracy of the adjusted positive and negative sample sets.

In order to reduce interference caused by stop words, numbers, place names and other contents to the domain identification process, in the embodiment of the application, the domain identification result can be determined by identifying sentence patterns, or the domain identification process is simplified by replacing interference terms. Therefore, the accuracy of the domain identification process can be improved, and the time occupied by the domain identification process can be further saved.

In an exemplary implementation, for sentence patterns difficult to identify by the sub-domain classifier, or sentence patterns easy to greatly affect the domain identification result, rules may be set in advance based on the sentence patterns, so as to be used in the text full-precision matching process of the control layer.

For example, text content obtained by speech recognition in examples 1 to 3 and domain recognition result obtained by domain recognition by the system are as follows:

example 1:

text corresponding to voice instructions input by a user: querying the strand of the lower large carbon

Domain identification result: stock shares

Example 2:

text corresponding to voice instructions input by a user: querying stocks 600160 of lower Beijing harbor and Jiangcu CWB1

Domain identification result: stock shares

Example 3:

text corresponding to voice instructions input by a user: search for pictures shot in Beijing yesterday

Domain identification result: drawing library

For the text shown in example 1, the sentence pattern may be preset to be "the strand of query … …", so that when the user inputs an error or the voice recognition generates omission, as long as the sentence pattern is included in the text, the system can accurately recognize the sentence pattern, and the fields of the text are distinguished according to the sentence pattern, so that an accurate field recognition result is obtained.

For example, the rule may be preset to [ ] (search/look/tell/open), {1, 12} (strand) $ ], for the system to recognize and match the text quickly, and feed back the obtained domain recognition result to the mobile phone. The meaning of the sentence "[ (search/look/tell/open), {1, 12} (strand) $ ]" may be referred to the above description, and will not be described here.

In one exemplary implementation, for text that the rules cannot match, the classifier layer is still required to process.

Taking example 3 as an example, in an embodiment of the present application, "search … … picture" may be used as a mode. In this way, for the gallery domain, the rules related in the sub-domain classifier corresponding to the gallery domain may include the pattern, which means that when the sub-domain classifier recognizes the pattern in the text recognition process, an effective domain recognition result may be fed back, that is, the text domain is the gallery.

Taking example 2 as an example, a common feature may be set in advance for the system to prevent the problem of inaccurate domain identification caused by the common feature. When the system processes the text, the number of 6 consecutive digits in the text can be replaced, for example, [600160] is replaced by @, so that the text is "query stock @ of the CWB1 for both kangang and jiang copper. The system may then invoke NER to extract the NE information in the text as a public feature, e.g., [ south-gening ] is defined as a "common company name" entity, which is replaced with #; the [ Jiang copper CWB1] is defined as a "Online company name code" entity, which is replaced with @. The content of this text is "query stock @ for one # and @.

Similarly, the time in the text may be replaced with $, and the place replaced with #, then the content of the text in example 3 is "search for $ picture taken at #.

After the above replacement process is completed, the respective features and the weights corresponding to each feature in example 2 are as follows:

querying: 0.1067646020633481

Polling one: -0.10021895439172483

The following steps: -0.215034710246020433

Lower #:0.1067646020633481

# and: null (null)

Where null indicates that the features "# and" have no effect on the domain identification result, or the weight of the feature is 0.

And @:0.12009207293891772

Ply: 0.304457783445201952

Stock: 1.1114948005328673

Ticket @:0.3067646020633481

After the above replacement process is completed, the respective features and the weights corresponding to each feature in example 3 are as follows:

searching: 0.3835541240544907

The following steps: -0.2517062504931636

The following $:0.14542119078470123

The method is characterized in that: 0.094333521958256

At #:0.19608161704432386

# pat: -0.006875871484002316

Beating: 0.5827998208565368

Is shown in the figure: 0.26154773801450293

Picture: 0.17497209951796067

+：1.4622835953886275

The feature "+" indicates that the text after replacement meets the pattern defined in the sub-domain classifier, so that when the value corresponding to the text is calculated, the text after replacement meets the pattern defined in the sub-domain classifier, and weight addition is obtained, so that the accuracy rate of domain identification is improved.

The value corresponding to the replaced text: 3.04241158392

It follows that due to the above described replacement procedure common features such as time, place etc. are replaced, whereas common features often comprise at least two words, meaning that after the replacement is completed the number of correspondence of the resulting feature to the weight is reduced. Especially for the situation that more common features are involved in the text, the alternative mode can effectively simplify the calculation process of the sub-field classifier, so that the working efficiency of the sub-field classifier is improved. Moreover, by adopting the alternative mode, the interference of the public features on the field recognition can be effectively reduced.

In the following examples, the content of the voice play or the display result obtained based on the voice command are shown in several different fields.

Text corresponding to voice instructions input by a user: the font is enlarged a little bit

Response: good, have been adjusted for you

The mobile phone carries out field recognition on the text with the content of 'enlarging the font' obtained after voice recognition, and the obtained field recognition result is that the text belongs to the setting field. And then the mobile phone sends the text to a dialogue engine corresponding to the setting field for processing. When the mobile phone gives a response to the user, the mobile phone can enlarge the font according to the requirement of the user.

Text corresponding to voice instructions input by a user: please help me set a no-disturbance mode of two to three afternoon today, except for the old king

Response: the no-disturb is turned on, from 14:00 to 15:00 except for the old and the King

The mobile phone carries out field recognition on a text which is obtained after voice recognition and contains ' please help me set a disturbance-free text with the exception of the old and the young ' at the afternoon at present ', and the obtained field recognition result is that the text belongs to the disturbance-free field. And then the mobile phone sends the text to a dialogue engine corresponding to the non-disturbing field for processing. When the mobile phone gives a response to the user, the mobile phone sets the opening time without disturbing according to the requirement of the user, and ensures that the user is still prompted for the old and the king call in the period without disturbing.

Text corresponding to voice instructions input by a user: picture with help me hundred degrees Fan Bingbing

Response: relevant photo jumping to hundred degree display Fan Bingbing

The mobile phone carries out field recognition on the text which is obtained after voice recognition and contains the picture of 'please help me hundred degrees under Fan Bingbing', and the obtained field recognition result is that the text belongs to the field of a gallery. And then the mobile phone sends the text to a dialogue engine corresponding to the gallery field for processing. It should be noted that, when the mobile phone gives a response to the user, the mobile phone completes the picture search according to the requirement of the user, i.e. the related photos are presented to the user through hundreds of degrees.

Text corresponding to voice instructions input by a user: what English speaking for chopsticks

Response: chopticks

The mobile phone carries out field recognition on the text which is obtained after voice recognition and contains what English is used for chopsticks, and the obtained field recognition result is that the text belongs to the translation field. And then the mobile phone sends the text to a dialogue engine corresponding to the translation field for processing. It should be noted that, when the mobile phone gives a response to the user, the mobile phone has completed the translation of the word "chopsticks" according to the user's requirement.

Text corresponding to voice instructions input by a user: how much weather is today

Response: shanghai today is rainy, 7-10 ℃, air quality index is 88, air can also be used, please remember to bring umbrella

The mobile phone carries out field recognition on a text with the content of what is the weather today obtained after voice recognition, and the obtained field recognition result is that the text belongs to the weather field. And then the mobile phone sends the text to a dialogue engine corresponding to the weather field for processing. It should be noted that, when the mobile phone gives a response to the user, the mobile phone has already determined the weather condition of the geographical location in combination with the geographical location where the user is currently located.

Text corresponding to voice instructions input by a user: stock of the myriad family

Response: vanke A now 39.42 yuan, drop width 0.86%, already had a rest market

The mobile phone carries out field recognition on the text with the content of stock in the universal family obtained after voice recognition, and the obtained field recognition result is that the text belongs to the stock field. And then the mobile phone sends the text to a dialogue engine corresponding to the stock domain for processing. It should be noted that, when the mobile phone gives a response to the user, the mobile phone has determined the stock condition that the user requests to know.

Text corresponding to voice instructions input by a user: to the 13 th power of 2

Response: the 13 th power of 2 is equal to 8192

The mobile phone carries out the field recognition on the text with the content of 'more than or equal to 13 times of 2' obtained after the voice recognition, and the obtained field recognition result is that the text belongs to the calculation field. And then the mobile phone sends the text to a dialogue engine corresponding to the computing field for processing. It should be noted that, when the mobile phone gives a response to the user, the mobile phone determines a calculation result expected by the user through a calculation mode.

Text corresponding to voice instructions input by a user: yao Ming height of

Response: yao Ming, 9.12 days 1980, is in the Shanghai city, the Suzhou city Wu Jiangou of Jiangsu province, the front Chinese professional basketball player, the mid-name of the judger and the general manager of the incumbent middle-office company. 4 months 1998, yao Ming entered the Wang Fei-licensed national team and began basketball careers. 2001 to obtain CBA conventional racing MVP,2002 to obtain CBA general champion and general resolution MVP, 3 times to select CBA backboard Wang Yi and cap king respectively, and 2 times to select CBA buckle king.

The mobile phone carries out field recognition on the text with the content of Yao Ming obtained after voice recognition, and the obtained field recognition result is that the text belongs to the encyclopedia field. The mobile phone can search the extracted keywords in encyclopedia and present the searched results to the user, and meanwhile, the mobile phone can selectively present the searched related contents to the user. It should be noted that, when the mobile phone gives a response to the user, the mobile phone has searched the height of Yao Ming and other relevant information.

In the above examples, the response mode includes, but is not limited to, text prompting mode or voice prompting mode.

The embodiment of the application can divide the functional modules of the electronic device according to the embodiment of the method, for example, each functional module can be divided corresponding to each function, or two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.

Fig. 8 is a schematic diagram showing an exemplary configuration of a voice recognition apparatus for an electronic device according to the above embodiment. The electronic device voice recognition apparatus 400 includes: a receiving module 401, a converting module 402, a first domain identification module 403, a processing module 404, a second domain identification module 405, a control module 406, and a sub-domain classifier 407. The sub-domain classifier 407 includes a named entity recognition module 4071, a replacement module 4072, an extraction module 4073, a calculation module 4074, and a domain determination module 4075. The electronic device 400 includes at least one sub-domain classifier 407, which is not limited herein.

The receiving module 401 is configured to support the electronic device 400 to receive a voice command. For example, the user inputs a voice command through the electronic device corresponding to the text, i.e., the voice input shown in fig. 4. The conversion module 402 is configured to support the electronic device 400 to convert voice instructions into text, for example, as shown in fig. 4, by converting input voice into text through voice recognition. The first domain identification module 403 is configured to support the electronic device 400 to identify the text through at least two sub-domain classifiers, so as to obtain a domain identification result. For example, the sub-domain classifier in each priority (i.e., sub-domain classifier group) in the classifier layer shown in fig. 4 recognizes text, for example, the sub-domain classifier 11, the sub-domain classifier 12, and the sub-domain classifier 13 in the priority 1 recognize text in parallel. The processing module 404 is configured to support the electronic device 400 to process the text through a dialog engine corresponding to the text in the field of application, to determine functions that the electronic device corresponding to the text needs to perform, and to support other processes for the electronic device 400 to implement the techniques described herein, and so on. The second sub-domain identification module 405 is configured to support the electronic device 400 to match text field operation texts, for example, as shown in fig. 4, in which the text in the control layer is matched with the pre-stored text in full precision, and when the text is successfully matched with the pre-stored text, the domain corresponding to the pre-stored text is determined to be a domain identification result of the text; when the matching between the text and the pre-stored text fails, the text is input to the classifier layer, and the first domain recognition module 403 performs domain recognition on the text through at least two sub-domain classifiers to obtain a domain recognition result.

In one implementation of the embodiment of the present application, the first domain identification module includes N sub-domain classifier groups, where each group has a different priority, and N is a positive integer greater than or equal to 2. At least two sub-domain classifiers are included in at least one of the N sub-domain classifier groups. Each sub-domain classifier is used for confirming whether the text belongs to the domain corresponding to the sub-domain classifier. The control module 406 is configured to support the electronic device 400 to control the sub-domain classifier in the highest priority group of the N sub-domain classifier groups to perform domain recognition on the text, for example, as shown in fig. 4, to control the sub-domain classifier in the highest priority group, i.e. the priority 1, in the classifier layer to perform domain recognition on the text. If the sub-domain classifier in the highest priority group identifies the domain to which the text belongs, the sub-domain classifier in the highest priority group identifies the domain to which the text belongs as a domain identification result; if the sub-domain classifier in the highest priority group does not recognize the domain to which the text belongs, performing domain recognition on the text by the sub-domain classifier in the next priority group in the N sub-domain classifier groups, for example, as shown in fig. 4, after the text is subjected to domain recognition by the sub-domain classifier in the priority 1, a domain recognition result is not obtained, and then performing domain recognition on the text by the sub-domain classifier in the priority 2 until: identifying the domain to which the text belongs, and taking the identified domain as a domain identification result; or the text has been subjected to domain recognition by all sub-domain classifiers in the set of N sub-domain classifiers. For example, as shown in fig. 4, after the text is subjected to the domain recognition of all the sub-domain classifiers in the priority 1, the priority 2 and the priority 3 of the classifier layer, the domain recognition result is not obtained, and then the processing process of the voice instruction is finished.

The control module 406 is further configured to determine that at least one of the first domain identification result and the second domain identification result is the domain identification result, or determine that both the first domain identification result and the second domain identification result are the domain identification result when the first sub-domain classifier performs domain identification on the text to obtain the first domain identification result and the second sub-domain classifier performs domain identification on the text to obtain the second domain identification result. Taking priority 1 as shown in fig. 4 as an example, when the sub-domain classifier 11 obtains the first domain identification result and the sub-domain classifier 12 obtains the second domain identification result, the control module 406 performs the above-described process. At this time, if the sub-domain classifier 13 obtains the third domain identification result, the control module 406 determines that at least one domain identification result of the first domain identification result, the second domain identification result, and the third domain identification result is a domain identification result of the text.

In the sub-domain classifier 407, the NER module is used to support the electronic device 400 to NER the text and determine common features in the identified content. The replacing module 4072 is configured to support the electronic device 400 to replace the utility feature in the text according to a preset rule. The extraction module 4073 is configured to support the electronic device 400 to perform feature extraction on the text that completes the replacement, and determine a weight of each feature. The calculation module 4074 is configured to support the electronic device 400 to calculate a value of the text based on the weight of each feature. The domain determining module 4075 is configured to support the electronic device 400 to determine that the text belongs to the domain corresponding to the sub-domain classifier when the value of the text is greater than the threshold. The sub-domain classifier 407 may be any one of the sub-domain classifiers involved in the classifier layer as shown in fig. 4.

In one implementation of an embodiment of the present application, the electronic device 400 may further include at least one of a storage module 408, a communication module 409, and a display module 410. Wherein the memory module 408 is used to support the electronic device 400 to store program codes and data of the electronic device; the communication module 409 may support data interaction between various modules in the electronic device 400 and/or support communication between the electronic device 400 and, for example, a server, other electronic devices, etc.; the display module 410 may support the electronic device 400 to present the processing result of the voice command to the user by text, graphics, or the like, or selectively present the voice recognition process to the user during the voice recognition process, which is not limited herein.

Wherein the receiving module 401 and the communication module 409 may be implemented as transceivers; the conversion module 402, the first domain identification module 403, the processing module 404, the second domain identification module 405, the control module 406, and the sub-domain classifier 407 may be implemented as a processor; the storage module 408 may be implemented as a memory; the display module 410 may be implemented as a display.

In one implementation of the embodiment of the present application, the processor may also be a controller, such as a CPU, a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an Application-specific integrated circuit (ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules and circuits described in connection with this disclosure. The processor may also be a combination that performs the function of a computation, e.g., a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, and the like. The transceiver may also be implemented as a transceiver circuit or a communication interface, etc.

As shown in fig. 9, the electronic device 50 may include: a processor 51, a transceiver 52, a memory 53, a display 54, and a bus 55. The transceiver 52, the memory 53 and the display 54 are optional components, i.e. the electronic device 50 may include one or more of the optional components. The processor 51, transceiver 52, memory 53, display 54 are interconnected by bus 55; bus 55 may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 9, but not only one bus or one type of bus.

The steps of a method or algorithm described in connection with the present disclosure may be embodied in hardware, or may be embodied in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access Memory (Random Access Memory, RAM), flash Memory, read Only Memory (ROM), erasable programmable Read Only Memory (Erasable Programmable ROM), electrically Erasable Programmable Read Only Memory (EEPROM), registers, hard disk, removable disk, a compact disk Read Only Memory (Compact Disc Read-Only Memory, CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in the same device, or they may reside as discrete components in different devices.

Embodiments of the present application provide a readable storage medium comprising instructions. The instructions, when executed on an electronic device, cause the electronic device to perform the method described above.

Embodiments of the present application provide a computer program product comprising software code for performing the above-described method.

The foregoing detailed description of the embodiments of the present application has been presented for purposes of illustration and description, and it should be understood that the foregoing detailed description of the embodiments of the application is not intended to limit the scope of the application, but is intended to cover any modifications, equivalents, improvements, etc. that fall within the scope of the embodiments of the application.

Claims

1. A method for voice recognition by an electronic device, the method comprising:

converting the received voice instruction into text;

performing domain recognition on the text in parallel through at least two sub-domain classifiers to obtain a domain recognition result, wherein the domain recognition result is used for representing the domain to which the text belongs;

and processing the text through a dialogue engine corresponding to the field to which the text belongs, and determining the function to be executed by the electronic equipment corresponding to the text.

2. The method of claim 1, wherein after said converting the received voice command to text, the method further comprises:

matching the text with a pre-stored text;

and when the text is successfully matched with the pre-stored text, determining the field corresponding to the pre-stored text as a field identification result of the text.

3. The method according to claim 2, wherein the text is subjected to domain recognition in parallel by at least two sub-domain classifiers to obtain a domain recognition result, specifically:

and when the text fails to match with the pre-stored text, performing domain identification on the text in parallel through at least two sub-domain classifiers to obtain a domain identification result.

4. A method according to any one of claims 1 to 3, wherein the electronic device comprises N sub-domain classifier groups, wherein each group has a different priority, N being a positive integer greater than or equal to 2;

the text is subjected to field recognition in parallel through at least two sub-field classifiers to obtain a field recognition result, which is specifically:

performing domain identification on the text through the sub-domain classifier in the highest priority group in the N sub-domain classifier groups;

If the sub-domain classifier in the highest priority group identifies the domain to which the text belongs, the sub-domain classifier in the highest priority group identifies the domain to which the text belongs as the domain identification result;

if the sub-domain classifier in the highest priority group does not recognize the domain to which the text belongs, performing domain recognition on the text by the sub-domain classifier in the next priority group in the N sub-domain classifier groups until:

identifying the domain to which the text belongs, and taking the identified domain as the domain identification result; or (b)

The text is subjected to domain recognition through all sub-domain classifiers in the N sub-domain classifier groups;

at least two sub-domain classifiers are included in at least one of the N sub-domain classifier groups.

5. The method of claim 4, wherein the domain identification accuracy of the sub-domain classifiers in the low priority group is lower than the domain identification accuracy of the sub-domain classifiers in the high priority group in the N sub-domain classifier groups.

6. The method of claim 4 or 5, wherein at least one of the N sub-domain classifier groups comprises a first sub-domain classifier and a second sub-domain classifier, the method further comprising:

When the first sub-domain classifier performs domain recognition on the text to obtain a first domain recognition result and the second sub-domain classifier performs domain recognition on the text to obtain a second domain recognition result,

determining at least one of the first domain identification result and the second domain identification result as the domain identification result; or (b)

And determining that the first domain identification result and the second domain identification result are both the domain identification result.

7. The method of any of claims 1 to 4, wherein at least one of the at least two sub-domain classifiers performs domain recognition on the text, comprising:

identifying NER by named entity to the text, and determining public features in the identified content;

replacing the public features according to preset rules, wherein the preset rules comprise replacement contents corresponding to the public features of different categories;

extracting features of the text subjected to replacement, and determining the weight of each feature;

calculating the value of the text according to the weight of each feature;

and when the value of the text is larger than a threshold value, determining that the text belongs to the field corresponding to the sub-field classifier.

8. An electronic device, the electronic device comprising:

the receiving module is used for receiving the voice instruction;

the conversion module is used for converting the voice instruction received by the receiving module into text;

the first domain identification module is used for carrying out domain identification on the text obtained by conversion of the conversion module through at least two sub-domain classifiers in parallel to obtain a domain identification result, wherein the domain identification result is used for representing the domain to which the text belongs;

and the processing module is used for processing the text through the dialogue engine corresponding to the field to which the text belongs and determined by the first field identification module, and determining the function to be executed by the electronic equipment corresponding to the text.

9. The electronic device of claim 8, wherein the electronic device further comprises:

and the second domain identification module is used for matching the text with a pre-stored text, and determining the domain corresponding to the pre-stored text as a domain identification result of the text when the text is successfully matched with the pre-stored text.

10. The electronic device of claim 9, wherein the first domain identification module is specifically configured to:

And when the second domain identification module fails to match the text with the pre-stored text, performing domain identification on the text in parallel through at least two sub-domain classifiers to obtain a domain identification result.

11. The electronic device of any one of claims 8-10, wherein the first domain identification module comprises:

n sub-domain classifier groups, wherein each group has different priorities, and N is a positive integer greater than or equal to 2; at least one of the N sub-domain classifier groups comprises at least two sub-domain classifiers; each sub-domain classifier is used for confirming whether the text belongs to the domain corresponding to the sub-domain classifier;

a control module for:

controlling the sub-domain classifier in the highest priority group in the N sub-domain classifier groups to perform domain recognition on the text;

The text has been domain identified by all of the sub-domain classifiers in the set of N sub-domain classifiers.

12. The electronic device of claim 11, wherein the domain identification accuracy of the sub-domain classifiers in the low priority group is lower than the domain identification accuracy of the sub-domain classifiers in the high priority group in the N sub-domain classifier groups.

13. The electronic device of claim 11 or 12, wherein at least one of the N sub-domain classifier groups comprises a first sub-domain classifier and a second sub-domain classifier,

the control module is further configured to:

14. The electronic device of any one of claims 8 to 11, wherein the sub-domain classifier comprises:

the named entity recognition NER module is used for carrying out NER on the text and determining public features in the recognized content;

the replacing module is used for replacing the public features determined by the identifying module according to preset rules, wherein the preset rules comprise replacing contents corresponding to the public features of different categories;

the extraction module is used for extracting the characteristics of the text which is replaced by the replacement module and determining the weight of each characteristic;

a calculation module, configured to calculate a value of the text according to the weight of each feature determined by the extraction module;

and the domain determining module is used for determining that the text belongs to the domain corresponding to the sub-domain classifier when the value of the text is larger than the threshold value.

15. An electronic device includes a memory, one or more processors, a plurality of applications, and one or more programs; wherein the one or more programs are stored in the memory; wherein the one or more processors, when executing the one or more programs, cause the electronic device to implement the method of any of claims 1-7.

16. A readable storage medium having instructions stored therein which, when executed on an electronic device, cause the electronic device to perform the method of any of the preceding claims 1 to 7.

17. A computer program product, characterized in that it comprises software code for performing the method according to any of the preceding claims 1 to 7.