CN111373473A

CN111373473A - Electronic equipment and method for performing voice recognition by using same

Info

Publication number: CN111373473A
Application number: CN201880074893.0A
Authority: CN
Inventors: 隋志成; 李艳明
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-03-05
Filing date: 2018-03-05
Publication date: 2020-07-03
Anticipated expiration: 2038-03-05
Also published as: WO2019169536A1; CN111373473B

Abstract

A method for performing voice recognition on electronic equipment and the electronic equipment relate to the technical field of terminals and can improve flexibility of the terminals in local voice instruction recognition. The method comprises the following steps: the received voice instruction is converted into a text, then the text is subjected to field recognition through at least two sub-field classifiers to obtain a field recognition result, wherein the field recognition result is used for representing the field to which the text belongs, and then the text is processed through a dialogue engine corresponding to the field to which the text belongs to determine the functions to be executed by the electronic equipment corresponding to the text. The method is suitable for the voice recognition process.

Description

Electronic equipment and method for performing voice recognition by using same

Technical Field

The present application relates to the field of terminal technologies, and in particular, to a method for performing voice recognition on an electronic device and an electronic device.

Background

With the development of terminal technology, especially the popularization of voice recognition technology, currently, a user can invoke a terminal to execute a corresponding function by inputting a voice instruction to the terminal. Taking the terminal as a mobile phone as an example, a user can input a section of voice through the mobile phone, and then the mobile phone sends the section of voice to the cloud, so that the cloud converts the section of voice into a text, and processes the text to obtain a processing result. And then the cloud returns the processing result to the mobile phone so that the mobile phone executes the function matched with the processing result according to the processing result.

Therefore, the implementation process mainly depends on the processing capacity of the cloud. That is, for the situation that the terminal cannot realize data interaction with the cloud, it is difficult for the terminal to execute the corresponding function according to the input voice command. In order to solve the above problems, at present, a function of recognizing and processing a voice instruction is added in a terminal, and after the terminal converts a voice into a text by using a voice recognition technology, the terminal may process the text in a template matching manner to determine a function that the terminal needs to call, that is, the processing result. The template matching means that the terminal matches the obtained text with an existing template and determines a template capable of completely matching the text. Then, the terminal can determine the function corresponding to the template according to the corresponding relationship between the template and the function, and the terminal executes the function.

But for the above implementation it is necessary to ensure that the resulting text matches the template exactly. For example, if the template specifies that the structure of the text is "time + place + what to do", the terminal can determine that the text matches the template when the structure of the text satisfies the structure of "time + place + what to do". When the structure of the text is a structure of "place + time + what" because the template structure cannot be completely matched with the text structure, the terminal cannot determine the function matched with the text because the template matched with the text cannot be found, and the user cannot call the terminal to execute the function by inputting a voice instruction.

Disclosure of Invention

The embodiment of the application provides a method for performing voice recognition on electronic equipment and the electronic equipment, so as to improve the flexibility of a terminal in performing voice instruction recognition locally.

In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:

in a first aspect, an embodiment of the present application provides a method for performing speech recognition by an electronic device. The method comprises the following steps: the received voice instruction is converted into text. And then, performing field recognition on the text through at least two sub-field classifiers to obtain a field recognition result. Wherein, the domain recognition result is used for representing the domain to which the text belongs. And processing the text through a dialog engine corresponding to the field to which the text belongs, and determining the functions to be executed by the electronic equipment corresponding to the text. By adopting the method to realize the voice recognition process, the fields of the text can be effectively distinguished, and then the field-based text recognition process is more pertinently completed, so that the functions to be executed by the electronic equipment are determined, and the accuracy of the voice recognition is enhanced. Moreover, the implementation process can be performed locally in the electronic device. Even in the process that the electronic equipment cannot access the network, the recognition of the voice command can be realized on the basis of not depending on the cloud processing capability, so that the flexibility of the voice recognition is improved.

In one exemplary implementation, after converting the voice instructions to text, the text may be matched to a pre-stored text. And when the matching of the text and the pre-stored text is successful, determining that the field corresponding to the pre-stored text is a field identification result of the text. In the implementation manner, the pre-matching can reduce the resource consumed by the subsequent domain identification through the sub-domain classifier. The matching process can be used for preliminarily screening the text, and if the converted text conforms to the common sentence pattern, the field to which the text belongs can be accurately identified on the basis of the corresponding relation between the existing pre-stored text and the field without the participation of a sub-field classifier, so that the field identification process based on the voice instruction is completed.

In an exemplary implementation manner, the parallel domain recognition is performed on the text by at least two sub-domain classifiers to obtain a domain recognition result, which may specifically be implemented as: and when the matching of the text and the pre-stored text fails, performing parallel field recognition on the text through at least two sub-field classifiers to obtain a field recognition result. Considering that the text does not conform to the common sentence pattern, after the text is preliminarily screened, the text can be subjected to domain recognition by the sub-domain classifier. It should be noted that, the process of performing the domain recognition on the text by the sub-domain classifiers can be implemented as performing the parallel domain recognition on the text by a plurality of sub-domain classifiers, that is, at least two sub-domain classifiers exist to perform the domain recognition on the text at the same time, so as to save the time occupied by the domain recognition.

In one exemplary implementation, an electronic device includes N sub-domain classifier groups, where each group has a different priority, and N is a positive integer greater than or equal to 2. The text is subjected to parallel field recognition through at least two sub-field classifiers to obtain a field recognition result, which can be specifically realized as follows: and performing domain recognition on the text through the sub-domain classifier in the highest priority group in the N sub-domain classifier groups. If the sub-domain classifier in the highest priority group identifies the domain to which the text belongs, the sub-domain classifier in the highest priority group identifies the domain to which the text belongs as a domain identification result; if the sub-domain classifier in the highest priority group does not recognize the domain to which the text belongs, performing domain recognition on the text through the sub-domain classifier in the next priority group in the N sub-domain classifier groups until: identifying a field to which the text belongs, and taking the identified field as a field identification result; or the text is subjected to domain recognition through all the sub-domain classifiers in the N sub-domain classifier groups. Wherein at least one of the N sub-domain classifier groups includes at least two sub-domain classifiers.

In the implementation process, the sub-domain classifiers in each priority group identify the texts according to a certain sequence. In the implementation process, once the domain identification result is obtained after the domain identification of the sub-domain classifier in a certain priority group, the obtained domain identification result can be returned without submitting the text to the sub-domain classifier in the next priority group for domain identification, so that fewer sub-domain classifiers are used on the basis of ensuring that the accurate domain identification result is obtained.

In one exemplary implementation, at least two of the sub-domain classifiers in at least one of the N sub-domain classifier groups perform domain recognition on the text in parallel. In an exemplary implementation manner of the embodiment of the present application, each priority group does not necessarily include all of the plurality of sub-domain classifiers, that is, at least one priority group includes a plurality of sub-domain classifiers. It should be noted that, the more the number of the sub-domain classifiers for performing the domain identification on the text in parallel is, the more accurate the obtained domain identification result is.

In one exemplary implementation, the domain identification accuracy of the sub-domain classifiers in the low priority group is lower than the domain identification accuracy of the sub-domain classifiers in the high priority group in the N sub-domain classifier groups. The accuracy rate of the domain identification of the sub-domain classifier in the high priority group is higher than that of the sub-domain classifier in the low priority group. Therefore, the progressive field identification process layer by layer can effectively reduce the working pressure of the sub-field classifier with lower field identification accuracy, and further improves the accuracy of the whole field identification process.

In one exemplary implementation, at least one of the N sub-domain classifier groups includes a first sub-domain classifier and a second sub-domain classifier. When the first sub-domain classifier performs domain recognition on the text to obtain a first domain recognition result and the second sub-domain classifier performs domain recognition on the text to obtain a second domain recognition result, determining at least one of the first domain recognition result and the second domain recognition result as a domain recognition result; or determining that the first domain identification result and the second domain identification result are both domain identification results. Therefore, for the case that the plurality of sub-domain classifiers in the same priority group all obtain the domain identification result, the domain identification result may be selected based on a preset rule or a configured summary decision manner, for example, one of the domain identification results is selected as the final domain identification result, or a plurality of or all of the domain identification results are selected as the final domain identification result, where the rule or the decision manner is not limited.

In an exemplary implementation, the domain recognition of the text by each of the at least two sub-domain classifiers can be implemented as: naming the text entity identifies the NER and determines common features in the identified content. And then, replacing the common characteristics according to a preset rule. The preset rule comprises replacement contents corresponding to different types of common features. And then, carrying out feature extraction on the replaced text, determining the weight of each feature, and calculating the value of the text according to the weight of each feature. And when the value of the text is larger than the threshold value, determining that the text belongs to the field corresponding to the sub-field classifier. It should be noted that, by adopting the common feature replacement mode, the calculation resource occupied by calculating the value of the text can be reduced, and the influence of the functional feature on the field recognition process can be effectively reduced, so that the accuracy of field recognition on the text is improved.

In an exemplary implementation, before the text is subjected to parallel domain recognition through at least two sub-domain classifiers, the at least two sub-domain classifiers may be trained in advance. The training process of each sub-domain classifier is as follows:

positive and negative samples of the sub-domain classifier are generated. It should be noted that each sub-domain classifier may have its own independent positive and negative samples, where the positive and negative samples include a positive training sample set and a negative training sample set. The samples in the positive example training sample set are samples belonging to the corresponding field of the sub-field classifier, and the samples in the negative example training sample set are samples not belonging to the corresponding field of the sub-field classifier.

NER and rule extraction is performed on the positive and negative samples, and common feature replacement is performed on the NER processed positive and negative samples. In an implementation manner of the embodiment of the present application, the common features include, but are not limited to, words such as time and place, and may be preset. In the embodiments of the present application, common features may be replaced with symbols and the like, which are not limited herein. The rules include, but are not limited to, periods such as "search for pictures of … …". It should be noted that performing NER on positive and negative examples may be a prerequisite for rule extraction and common feature replacement. That is, the position, time, period, etc. in the positive and negative samples are identified by the NER, then the period is used as a rule, the time, position, etc. are used as common features, and the substitution between the common features and the symbols is completed.

Stop words, etc. That is, in the process of training the sub-domain classifier, for the positive and negative samples, to reduce the words and "the words" such as "o", "ya", etc.; the symbols such as "" and "" interfere with the recognition process, and the stop words need to be recognized and ignored in the field recognition process.

And extracting the features to generate a training corpus feature library, and calculating a value corresponding to the text according to the weight. The corpus feature library is used for storing the corresponding relation between the features and the weights.

And training a sub-field classifier, evaluating the influence of the error field recognition result, and modifying the positive and negative samples.

The training process can dynamically adjust the distribution condition of the positive and negative samples, so that the identification accuracy of the sub-field classifier is improved.

In a second aspect, an embodiment of the present application provides an electronic device. The electronic device may implement the functions implemented in the above method embodiments, and the functions may be implemented by hardware or by hardware executing corresponding software. The hardware or software comprises one or more modules corresponding to the functions.

In a third aspect, an embodiment of the present application provides an electronic device. The electronic device comprises a memory and one or more processors. Wherein the memory is configured to store computer program code comprising computer instructions. The one or more processors mentioned above, when reading and executing the computer instructions, cause the electronic device to implement the method of any of the first aspect and its various exemplary implementations.

In a fourth aspect, embodiments of the present application provide a readable storage medium including instructions. The instructions, when executed on an electronic device, cause the electronic device to perform the method of any of the first aspect and its various exemplary implementations described above.

In a fifth aspect, embodiments of the present application provide a computer program product, which includes software code for performing the method of any one of the first aspect and its various exemplary implementations.

Drawings

Fig. 1 is a schematic structural diagram of a terminal according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of an exemplary method provided by an embodiment of the present application;

fig. 3 is a flowchart of an exemplary method for processing a voice command by a mobile phone according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an exemplary domain identification multi-classification system provided in an embodiment of the present application;

fig. 5 is a schematic flow chart illustrating an implementation process of text field recognition by using the system shown in fig. 4 according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for training a sub-domain classifier under the condition that a text belongs to the domain according to an embodiment of the present application;

fig. 7 is a flowchart of a training method for adjusting positive and negative samples of a sub-domain classifier according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of another electronic device according to an embodiment of the present application.

Detailed Description

The embodiment of the application can be used for an electronic device, and the electronic device can be a terminal, for example, a notebook computer, a smart phone, a Virtual Reality (VR) device, an Augmented Reality technology (AR), a vehicle-mounted device, an intelligent wearable device and other devices. The terminal may be provided with at least a display screen, an input device, and a processor, and taking the terminal 100 as an example, as shown in fig. 1, the terminal 100 includes components such as a processor 101, a memory 102, a camera 103, an RF circuit 104, an audio circuit 105, a speaker 106, a microphone 107, an input device 108, other input devices 109, a display screen 110, a touch panel 111, a display panel 112, an output device 113, and a power supply 114. The display screen 110 is composed of at least a touch panel 111 as an input device and a display panel 112 as an output device. It should be noted that the terminal structure shown in fig. 1 is not limited to the terminal, and may include more or less components than those shown, or combine some components, or split some components, or arrange different components, and is not limited herein.

The various components of the terminal 100 will now be described in detail with reference to fig. 1:

a Radio Frequency (RF) circuit 104 may be configured to receive and transmit signals during information transmission and reception or during a call, for example, if the terminal 100 is a mobile phone, the terminal 100 may receive downlink information transmitted by a base station through the RF circuit 104 and then transmit the downlink information to the processor 101 for processing; in addition, data relating to uplink is transmitted to the base station. Typically, the RF circuitry includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 104 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.

The memory 102 may be used to store software programs and modules, and the processor 101 executes various functional applications and data processing of the terminal 100 by operating the software programs and modules stored in the memory 102. The memory 102 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (e.g., a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (e.g., audio data, video data, etc.) created according to the use of the terminal 100, and the like. Further, the memory 102 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

Other input devices 109 may be used to receive input numeric or character information and generate key signal inputs relating to user settings and function control of terminal 100. In particular, other input devices 109 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, a light mouse (a light mouse is a touch-sensitive surface that does not display visual output, or is an extension of a touch-sensitive surface formed by a touch screen), and the like. Other input devices 109 may also include sensors built into terminal 100, such as gravity sensors, acceleration sensors, etc., and terminal 100 may also use parameters detected by the sensors as input data.

The display screen 110 may be used to display information input by or provided to the user and various menus of the terminal 100, and may also accept user input. In addition, the Display panel 112 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like to configure the Display panel 112; the touch panel 111, also referred to as a touch screen, a touch-sensitive screen, etc., may collect contact or non-contact operations (for example, operations performed by a user on or near the touch panel 111 using any suitable object or accessory such as a finger, a stylus, etc., and may also include body-sensing operations, where the operations include single-point control operations, multi-point control operations, etc., and drive the corresponding connection device according to a preset program. It should be noted that the touch panel 111 may further include two parts, namely, a touch detection device and a touch controller. The touch detection device detects the touch direction and gesture of a user, detects signals brought by touch operation and transmits the signals to the touch controller; the touch controller receives touch information from the touch detection device, converts the touch information into information that can be processed by the processor 101, and transmits the information to the processor 101, and also receives and executes commands sent by the processor 101. In addition, the touch panel 111 may be implemented by using various types such as resistive, capacitive, infrared, and surface acoustic wave, and the touch panel 111 may also be implemented by using any technology developed in the future. In general, the touch panel 111 may cover the display panel 112, a user may operate on or near the touch panel 111 covered on the display panel 112 according to the content displayed on the display panel 112 (the display content includes, but is not limited to, a soft keyboard, a virtual mouse, virtual keys, icons, etc.), the touch panel 111 detects the operation on or near the touch panel 111, and transmits the operation to the processor 101 to determine a user input, and then the processor 101 provides a corresponding visual output on the display panel 112 according to the user input. Although in fig. 1, the touch panel 111 and the display panel 112 are two separate components to implement the input and output functions of the terminal 100, in some embodiments, the touch panel 111 and the display panel 112 may be integrated to implement the input and output functions of the terminal 100.

RF circuitry 104, speaker 106, and microphone 107 may provide an audio interface between a user and terminal 100. The audio circuit 105 may transmit the converted signal of the received audio data to the speaker 106, and the converted signal is converted into a sound signal by the speaker 106 and output; alternatively, the microphone 107 may convert the collected sound signals into signals, convert the signals into audio data after being received by the audio circuit 105, and output the audio data to the RF circuit 104 to be transmitted to a device such as another terminal, or output the audio data to the memory 102 for further processing by the processor 101 in conjunction with the content stored in the memory 102. In addition, the camera 103 may capture image frames in real time and transmit them to the processor 101 for processing, and store the processed results in the memory 102 and/or present the processed results to the user through the display panel 112.

The processor 101 is a control center of the terminal 100, connects various parts of the entire terminal 100 using various interfaces and lines, performs various functions of the terminal 100 and processes data by running or executing software programs and/or modules stored in the memory 102 and calling data stored in the memory 102, thereby monitoring the terminal 100 as a whole. It is noted that processor 101 may include one or more processing units; the processor 101 may also integrate an application processor, which mainly handles operating systems, User Interfaces (UIs), application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 101.

The terminal 100 may further include a power supply 114 (e.g., a battery) for supplying power to various components, and in this embodiment, the power supply 114 may be logically connected to the processor 101 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system.

In addition, there are also components not shown in fig. 1, for example, the terminal 100 may further include a bluetooth module and the like, which are not described herein.

The following explains an embodiment of the present application by taking the terminal 100 as a mobile phone as an example.

At present, after a mobile phone sends a received voice instruction to a cloud, the cloud converts the voice instruction into a text through a voice recognition technology, and then processes the text to determine a function, namely a processing result, which is required to be executed by the mobile phone and corresponds to the text. The process of processing the text by the cloud can be realized by matching the text with the contents set in the template one by the cloud, and finally obtaining a processing result; or the cloud extracts keywords and keywords in the text, and then obtains a processing result based on the keywords and the keywords. And then the cloud returns the processing result to the mobile phone, and the mobile phone realizes the function corresponding to the processing result.

Therefore, the process of converting the voice command into the text and the subsequent processing process aiming at the text can occur at the cloud end, and the mobile phone only needs to send the received voice command to the cloud end, receive the processing result sent by the cloud end after the processing of the voice command is completed at the cloud end, and execute the corresponding function aiming at the processing result.

In the implementation process, data transmission can be implemented between the mobile phone and the cloud terminal through a network, so that for the situation that the mobile phone cannot be networked, the mobile phone cannot be ensured to accurately and effectively execute the function corresponding to the voice instruction.

In addition, no matter whether the mobile phone needs to be connected with the network to process the voice command or not, for the case that the text processing is completed by adopting a template matching mode at the cloud and the case that the text processing is completed by adopting a template matching mode at the local part of the mobile phone, the situation that most of templates are obtained by manual data is considered, and a large amount of manpower and material resources are usually occupied when more templates are generated; moreover, the template is fixed after being generated, and when the structure of the voice command cannot be completely matched with the structure of the template, the failure rate of the processing process can be increased, namely the process of text processing by adopting the template is poor in flexibility, and the time consumed is long due to the fact that the text needs to be matched with more templates. Similarly, similar problems occur when the text is processed by extracting keywords.

In the process of template matching of the text, when the text content relates to multiple fields, ambiguity is easy to generate, namely the recognition rate is low. For example, the text is "how you say in english", the translation is related, the language setting is also related, and the cloud identifies the text as "set language"; the text is 'eye protection mode is opened at once in translation', the translation is related, the mode starting is related, and the cloud identifies the text as 'eye protection mode is opened'; the text is 'helping me to remember the geographic position of the restaurant', the position information and the record are related, and the cloud identifies the text as 'Global Positioning System (GPS)'; the text is 'big font is spoken by microblog', the font and the font adjustment are involved, and the cloud identifies the text as 'font'; the text is 'remind me to start the flight mode in the afternoon tomorrow', the time is related, the mode is started, and the text is recognized as 'start the flight mode' by the cloud. Therefore, when multiple fields are involved in the text, the cloud or the mobile phone can hardly accurately determine the function corresponding to the text.

Wherein a domain refers to a type of text. The type division can be based on the language environment of the text, in the embodiment of the application, the fields can be used as the text types, and in the process of processing the text, one field corresponds to one type of task, namely, the text belonging to the same field is used as the same type of task to be processed by the same dialog engine. The Process of processing the text by the dialog engine may include analyzing the text by Natural Language Processing (NLP) technology, and outputting a processing result. The processing result may include a code of a function that needs to be implemented by the mobile phone, so that the mobile phone calls a corresponding function, and the specific implementation manner may be that the dialog engine is used to analyze a text in the field, determine a function to be executed corresponding to the text, and the dialog engine may also generate an instruction code corresponding to the function to be executed; the instruction code is code that a machine can recognize and perform a corresponding function.

The code may be a code in a binary form, or may be a high-level language code such as < play > < wangfei > < song > (an instruction code generated for a voice instruction "i want to listen to music of royal phenanthrene" input by the user), which is not limited in the present application.

In order to solve the above problem, an embodiment of the present application provides a speech recognition method. Fig. 2 is a schematic flow chart of an exemplary method provided in the embodiment of the present application. After the mobile phone receives the voice command through the user voice inlet, the voice command is converted into a text by adopting a voice recognition technology, then the text is subjected to field recognition by the terminal, a field recognition result obtained through the field recognition is fed back to a dialogue engine corresponding to the field recognition result for processing, and finally the obtained processing result is fed back to the mobile phone.

It should be noted that, for the case that the user inputs the voice instruction through the portal provided by the third-party application or the system application, the dialog engine may feed back the processing result to the application providing the portal, so that the application providing the portal can implement functions such as interface switching; for the condition that the user inputs the voice instruction through the entrance provided by the system-level display interface of the mobile phone, when the user inputs the voice instruction, the mobile phone can be positioned in the main interface or the system-level display interface such as a setting interface and the like, but not the running interface of the application in the mobile phone, so that the dialogue engine can feed back the processing result to the system of the mobile phone, and the system of the mobile phone can realize the functions such as running a certain application, adjusting the font size of the display interface and the like. For example, the interface presented to the user by the mobile phone is the main interface of the mobile phone, and the user runs the game application by inputting a voice command of "open the game application". After the mobile phone completes voice recognition and subsequent processing on the voice command, the processing result is fed back to the system of the mobile phone, and the game application is started by the system of the mobile phone.

The system application includes, but is not limited to, an application program which is pre-installed when the mobile phone leaves a factory and has a function of receiving a voice instruction; the third-party applications include, but are not limited to, an application program with a function of receiving a voice instruction, which is downloaded and installed by a user from a platform such as application provisioning, and an application program that implements a call function through other applications in a mobile phone, and the like. In the embodiment of the application, the system level display interface refers to an interface other than an application running interface in the mobile phone, for example, a main interface, a setting interface, and the like of the mobile phone; the running interface of the application includes, but is not limited to, an interface presented to a user through a mobile phone during or after the application is started, for example, a loading interface of the application, a setting interface of the application, and the like.

In the process of identifying the above-mentioned fields, the fields include, but are not limited to, setting (setting), no disturbance (nondistrurb), gallery (gallery), translation (translate), stock (stock), weather (weather), calculation (calculator), and encyclopedia (Baike).

In the embodiment of the present application, the text converted by speech recognition may be classified into a predetermined category through recognition of keywords in the text, or by means of template matching, or through processing of a sub-domain classifier, and the predetermined category includes, but is not limited to, the above-mentioned exemplary fields. The implementation manners of keyword recognition and template matching may refer to the implementation manner of performing field recognition on a text in the prior art, which is not described herein again; the sub-domain classifier may be disposed in the multi-classification system for domain recognition shown in fig. 2, and the functions of the sub-domain classifier, etc., will be provided later and will not be described herein.

The multi-classification system for the domain recognition shown in fig. 2 is intended to perform the domain recognition on the text of which the mobile phone has completed the conversion, and output the corresponding domain recognition result. And then, the mobile phone hands the text to a dialogue engine corresponding to the field recognition result for processing according to the field recognition result, and obtains a processing result, so that the mobile phone calls a corresponding function according to the indication of the processing result.

The user voice portal may be a general portal such as a voice assistant, or may be a local portal of a system application in a mobile phone or a third party application. For example, taking system application as an example, a user inputs voice in a gallery, so that a mobile phone completes a function of searching for pictures in the gallery.

Fig. 3 is a flowchart illustrating an exemplary method for processing a voice command for a mobile phone according to an embodiment of the present application. Taking the example that the user opens the Wi-Fi function of the mobile phone through the voice command, the voice command input by the user is 'open Wi-Fi', and the text with the content of 'open Wi-Fi' is obtained after the voice command is subjected to voice recognition. The mobile phone performs field recognition on the text, confirms that the field to which the text belongs is a setting field, sends the text to a dialogue engine corresponding to the setting field for processing, namely, the mobile phone sends a field recognition result obtained after the field recognition to a local multi-field semantic understanding dialogue engine corresponding to the setting field, the field recognition result is processed by the dialogue engine, and then the mobile phone executes a corresponding function according to the processing result. In the embodiment of the application, the dialog engine may further prompt the user that the mobile phone has completed execution of the corresponding function according to a voice instruction input by the user in a manner of voice playing or popping up a dialog box, and the like, on the execution result of the corresponding function executed by the mobile phone. For example, in the example shown in fig. 3, the handset may play "Wi-Fi turned on" by voice playing, or pop up a word including, for example, "Wi-Fi turned on" to prompt the user.

Compared with the process of locally realizing the field identification by the mobile phone in the prior art, the method for carrying out the field identification in the embodiment of the application is different from the prior art. In the prior art, the processing process of the voice instruction mainly depends on template matching, so that when the text structure is different from the template structure, the mobile phone cannot obtain an accurate processing result. In the embodiment of the application, a domain identification and dialogue engine is introduced, wherein the domain identification process not only considers template matching and keyword extraction, but also can adopt a mode of working together by a plurality of parallel sub-domain classifiers to screen one or more domains corresponding to texts from a plurality of domains, and the texts are handed to the dialogue engines corresponding to the screened domains for processing. Therefore, for the condition that the text structure and the template result cannot be completely matched, the mobile phone can still further analyze and process the text. It should be noted that, for a text related to multiple fields, a mobile phone can process the text from the perspective of multiple fields, and not only push the text to a dialog engine corresponding to one field for processing. Therefore, by adopting the voice recognition process provided by the embodiment of the application, the fields of the text can be effectively distinguished, and then the field-based text recognition process is more pertinently completed, so that the functions to be executed by the electronic equipment are determined, and the accuracy of voice recognition is enhanced. And, the implementation process may be performed locally in the electronic device. Even in the process that the electronic equipment cannot access the network, the recognition of the voice command can be realized on the basis of not depending on the cloud processing capability, so that the flexibility of the voice recognition is improved.

It should be noted that the example shown in fig. 3 is a mobile phone dialog system interaction process, that is, a user inputs a voice instruction, and executes a function corresponding to the voice instruction through processing of the mobile phone, and feeds back a result of the mobile phone executing the function to the user through a voice playing or displaying manner. The voice command input by the user interacts with the voice playing content output by the mobile phone or the display result and the travel dialogue system. That is, the voice output by the mobile phone plays or displays the result, which provides an exemplary response mode for the mobile phone to the user, so as to respond to the voice command input by the user during or after the mobile phone executes the corresponding function.

Fig. 4 is a diagram illustrating an exemplary domain identification multi-classification system according to an embodiment of the present application. The multi-classification system for domain recognition aims to complete domain recognition of a text according to the text. In the domain identification process, the system can be divided into three layers, namely a control layer, a classifier layer and an algorithm layer.

The functions, actions, and the like of each layer involved in the system will be described below.

In one illustrative example, the control layer includes the following components: text fast full-precision matching, domain scheduling, classification decision and a data loader.

The fast full-precision matching of the text refers to that the control layer can directly divide the field of the text for commonly used phrases, sentence patterns and the like, such as commonly used and ambiguous fixed expressions, without further processing the text through the classifier layer. In the embodiment of the present application, the template for fast full-precision matching of the text may be preset, and the specific setting mode may refer to an existing manual template, for example, a template related to the template matching mode described in the background art, which is not described herein again.

The function of domain scheduling includes scheduling of the sub-domain classifiers in each priority of the classifier layer, for example, if the domain to which the text belongs is not successfully determined after the text is rapidly and precisely matched, the domain scheduling may schedule each sub-domain classifier in the priority1 to process the text, and continue to schedule each sub-domain classifier in the priority2 to process the text under the condition that each sub-domain classifier in the priority1 cannot determine the domain to which the text belongs, until it is determined that the domain to which the text belongs or all sub-domain classifiers of the classifier layer have processed the text and the domain to which the text belongs is not determined. In addition, for a single sub-domain classifier, the domain scheduling can also be used to invoke algorithms, rules, patterns, etc. involved by the sub-domain classifier.

The algorithm, the rule, the mode and the like are all used in the process that the sub-field classifier processes the text. In the embodiment of the application, when the text is matched with the rule or the text meets the rule, the classifier layer returns a field identification result; when the text matches the pattern, or the text satisfies the pattern, it may be determined that the text may have a greater probability of belonging to the domain corresponding to the pattern. In an embodiment of the present application, the rule corresponding to the text plays a decisive role in determining the field to which the text belongs, and the mode corresponding to the text increases the accuracy of determining the field to which the text belongs, and a specific implementation manner will be described in the specific examples mentioned later, which is not described herein again.

That is, the domain scheduling is used for linking the control layer and the classifier layer, after the text is subjected to full text precision matching and a result is not obtained, the scheduling of the sub-domain classifiers in each priority is sequentially realized according to the sequence of the priorities of the classifier layer from high to low, and in the process of scheduling the sub-domain classifiers to process the text, corresponding algorithms, rules, modes and the like are scheduled according to the requirements of the sub-domain classifiers.

And (4) a classification decision, namely a summary decision, which mainly aims to determine the field of the text or determine the field of the text without the text according to the processing result obtained by each priority of the classifier layer under the condition that the field of the text is not determined after the text is rapidly and fully matched by the control layer.

For example, after each sub-domain classifier in priority1 processes a text, it is determined that the text belongs to domain 1 and domain 2, and then a classification decision may be used to determine how to determine the domain to which the text belongs when the obtained domain recognition result includes multiple domains after the text is processed by all sub-domain classifiers in the same priority. In the embodiment of the present application, the classification decision may specify that the text belongs to one of the domains or multiple domains at the time, that is, the classification decision may specify that the text belongs to domain 1, domain 2, or both domain 1 and domain 2.

For another example, after the text is processed by each sub-domain classifier in the priority1, the domain to which the text belongs is not determined, and after the text is processed by each sub-domain classifier in the priority2, the text is determined to belong to the domain 1, so that the classification decision can determine the domain to which the text belongs by summarizing the domain recognition results obtained by each sub-domain classifier in the priority1 and the priority2, that is, the classification decision does not exist in the summary priority1, and the domain 1 to which the text belongs exists in the priority2, and finally the text is determined to belong to the domain recognition result of which the domain 1 is the text.

In the embodiment of the application, the mobile phone can generate an example in the process of recognizing the voice command, the example can be a task to be processed, and the task is that the mobile phone performs field recognition on a text converted from the voice command. In the same priority of the classifier layer, multiple sub-domain classifiers can process the same instance at the same time, that is, the mobile phone executes multiple tasks at the same time, so as to realize the domain identification of the text.

And the data loader is used for acquiring data of various libraries required by the algorithm layer, models of the sub-field classifiers in the classifier layer and configuration information from a local mobile phone, a network side or third-party equipment such as a server. The sub-domain classifier refers to a classifier corresponding to each domain; configuration information includes, but is not limited to, initialization parameters for the respective models, and the like.

In addition, the control layer is used as a layer for interacting with other components of the mobile phone in the system, the control layer can acquire a text obtained by completing voice recognition from the mobile phone, and after the text is processed by the system, a domain recognition result, namely a classification result, can be fed back to the mobile phone.

Therefore, the control layer is responsible for external service interaction interfaces, initialization data and model loading, field classification task scheduling, distribution of classification tasks of the sub-field classifiers and summarizing and decision of all returned classification results.

In an illustrative example, the classifier layer includes a plurality of priorities, such as priority 1(priority1), priority 2(priority2), and priority 3(priority 3). Wherein, the priority level 1 is greater than the priority level 2 and greater than the priority level 3. In each priority class, one or more instances of a class, i.e., a sub-domain classifier, may be set. Such as instance 11 of a class, instance 12 of a class, and instance 13 of a class in priority 1.

And the classifier layer is used for realizing the classification of the text. In the actual classification process, the classifier layer supports multi-level multi-instance task classification, that is, as described in the above paragraph, the classifier layer includes a plurality of priority classifier groups, and in the classifier groups with different priorities, there are a plurality of parallel sub-domain classifiers, and the plurality of parallel sub-domain classifiers can be executed simultaneously, so that the text domain classification process can realize a summary decision.

In a single sub-domain classifier, rules, patterns (patterns), Named Entity Recognition (NER), and a prediction part are included, thereby realizing extraction of sub-domain features and domain Recognition. It should be noted that the same text may have the same sub-domain features in different sub-domain classifiers in the same priority, and may also obtain different domain identification results.

The sub-domain features include but are not limited to keywords in a text, that is, in different domains, the same keyword may represent the same or different meanings, and in the embodiment of the present application, the keyword may have an influence on the domain recognition result; the domain identification result refers to processing of a text by a sub-domain classifier, and a domain to which the text may belong may be preliminarily predicted, for example, after processing of the same text by two sub-domain classifiers belonging to the same priority, one sub-domain classifier determines that the text belongs to the domain 1, and the other sub-domain classifier determines that the text belongs to the domain 2, so that the two sub-domain classifiers obtain different domain identification results, that is, the text belongs to the domain 1, and the text belongs to the domain 2.

In addition, for the classifier groups with different priorities, the two adjacent classifier groups with different priorities have a serial relationship. For the priority1 with higher priority in the classifier layer, under the condition that the sub-domain classifier in the priority1 obtains an effective domain identification result, the domain identification result can be fed back to the mobile phone through the control layer; under the condition that the sub-domain classifiers in the priority1 do not obtain the effective domain identification result, the text can be transmitted to each sub-domain classifier in the serial priority2 for processing, and so on until the effective domain identification result position is obtained. If the domain recognition result is not obtained after the text traverses each priority in the classifier layer, the classification result of the domain recognition result which is not obtained can be fed back to the mobile phone. The valid domain identification result means that the domain to which the text belongs can be determined in a certain priority, and the determined domain is the valid domain identification result.

It should be noted that the number of priorities in the classifier layer and the number of subdomain classifiers in the same priority are not limited in the embodiment of the present application. In addition, a domain corresponding to each sub-domain classifier may be predefined. In addition, in the use process of the subsequent system, the corresponding field of each sub-field classifier can be adjusted, wherein the adjustment includes, but is not limited to, the adjustment of the priority of the sub-field classifier, the adjustment of the corresponding field of the sub-field classifier, the increase and decrease of the number of the sub-field classifiers, and the like. For example, one sub-domain classifier is moved from one priority level to another, sub-domain classifiers in different priority levels are exchanged, and so on.

In the actual configuration process, the fields with higher field identification precision can be corresponded to the high-priority sub-field classifier; and corresponding the model with better performance to the high-priority sub-domain classifier. For the text in the field needing to be recognized, the text can be recognized with high priority and high timeliness through high-precision recognition due to the fact that the high-priority sub-field classifier is high in recognition precision and the used model is good in performance. When the text can identify the domain in the sub-domain classifier with the highest priority, the system can return the domain identification result to the mobile phone. That means, after the high-priority sub-domain classifier is subjected to domain recognition, no effective domain recognition result is obtained, and then the text can be handed to the next-level sub-domain classifier for domain recognition until an effective domain recognition result is obtained or the text is processed by each level of sub-domain classifier. The effective domain identification result means that the system determines the domain corresponding to the text; the highest priority sub-domain classifier refers to each sub-domain classifier such as in priority1 of fig. 4, that is, the sub-domain classifier 11, the sub-domain classifier 12, and the sub-domain classifier 13. It should be noted that, at the classifier layer, the texts may be sequentially recognized according to the order of the priorities of the groups in which the sub-domain classifiers are located from high to low. Of course, when the text identifies a domain in a group where a certain sub-domain classifier is located, the process of identifying the domain for the text can be ended.

In an exemplary embodiment, the algorithm layer is used for providing an algorithm and a model. Among them, the model refers to a database such as a rule (rule) library, a Named Entity (NE) library, and a feature (feature) library. The algorithms provided by the algorithm layer can also be embodied in the form of a database, such as an algorithm model library. In the algorithm model library, a plurality of algorithms are included.

It should be noted that, before invoking the above algorithms, the data loader of the control layer needs to load the contents related to the algorithms into the system, so as to provide for flexible invocation of the classifiers of each sub-domain of the classifier layer.

Fig. 5 is a schematic flow chart illustrating an implementation process of text domain recognition using the system shown in fig. 4.

After the mobile phone inputs the text into the system shown in fig. 4, the system firstly carries out text fast full-precision matching on the text at the control layer, and directly takes the obtained field as a recognition result under the condition that the field can be successfully determined for the text; in the event that the domain cannot be successfully determined for the text, the text may continue to be further processed by the classifier layer.

The further processing of the text can be realized by sequentially performing the domain recognition on the text according to the order from high to low of the priority group to which each sub-domain classifier of the classifier layer belongs. In the process of identifying the text, no matter which priority group the text carries out the field identification, as long as the field corresponding to the text is identified, the field is fed back to the mobile phone as the field identification result, and the text is not processed by the next priority group.

As shown in fig. 5, the system performs classification task scheduling for priority1, that is, calls the sub-domain classifier 11, the sub-domain classifier 12, and the sub-domain classifier 13 to perform parallel domain recognition on the text input into the system. The parallel domain recognition means that the sub-domain classifier 11, the sub-domain classifier 12 and the sub-domain classifier 13 perform domain recognition on the text at the same time, or perform domain recognition on the text according to a certain time sequence, then each sub-domain classifier outputs a domain recognition result, and the classification decision corresponding to the priority1 realizes judgment on the output 3 domain recognition results so as to determine the domain recognition result fed back to the mobile phone or input the text into the next priority group.

For the case of inputting the text into the next priority group, the system will continue to perform classification task scheduling for priority2, that is, call the sub-domain classifier 21, the sub-domain classifier 22, and the sub-domain classifier 23 to perform parallel domain recognition on the text input into the system. Wherein, the realization of the parallel domain identification process can refer to the description of the above paragraph. Similarly, after the system completes the field recognition for the text according to the priority2, the system can perform the field recognition for each sub-field classifier corresponding to the text input priority3, and can also directly feed back the obtained effective field recognition result to the mobile phone. The effective field identification result refers to the field of the text obtained by full-precision matching of the text of the control layer; or the field of the text is obtained through the processing of one or more priority groups in the classifier layer; or the result of the text field is not obtained after the text passes through the processing of each priority of the control layer and the classifier layer.

In the embodiment of the present application, the process of text field recognition by the system is finished when an effective field recognition result is obtained, or is finished when an effective field recognition result cannot be obtained after the text is field-recognized by each priority group in the classifier layer.

It should be noted that, when the sub-domain classifiers in the same priority group perform domain identification on the text at the same time, the determination process of the domain determination is performed in the same time period by the multiple sub-domain classifiers, so that the time occupied by the determination process can be effectively saved. When the sub-field classifiers in the same priority group perform field identification on the text according to a certain time sequence, one sub-field classifier operates in a period of time, so that the system can be ensured to occupy less resources for the operation of the single sub-field classifier in the period of time, and the condition that enough resources exist in the mobile phone for other systems or programs to call is ensured.

Referring to the system shown in fig. 4 and the method flow shown in fig. 5, it can be known that, compared with a scheme in the prior art in which the cloud implements speech recognition, the system provided in the embodiment of the present application has higher expandability, higher flexibility, higher accuracy, and is more refined.

The system has high expansibility, which means that the system can support any expansion of a new vertical class in the future, and the existing model, namely the system provided by the embodiment of the application, does not need to be established again. The vertical categories refer to categories of different fields related in the embodiments of the present application, such as setting, do not disturb, gallery, translation, stock, weather, calculation, encyclopedia, and the like. That is, in the subsequent use process, the sub-domain classifiers corresponding to other domains can be added in the classifier layer in combination with the requirements of different application scenarios.

The term "flexible" refers to that different priority groups can be flexibly adjusted according to the characteristics of current and future vertical classes, for example, the increase and decrease of the sub-domain classifier in a single priority group, the exchange of the sub-domain classifiers among multiple priority groups, and the like, which is not limited herein. Therefore, the classifier layer can be guaranteed to obtain a relatively accurate field recognition result after summary decision.

The higher accuracy indicates that for a single sub-domain classifier, a specific analysis and calculation method can be adopted to process the text in combination with the features of the corresponding domain of the sub-domain classifier, for example, the processing of numbers and stop words (stopwords), the selection of a binary method (bi-gram) and a ternary method (tri-gram), the feature extraction range and the feature extraction mode, and the like. Because the same or different processing modes are adopted in the classifiers in different sub-fields, the method can have more pertinence, and therefore, the accuracy is relatively high.

More refined, namely screening of training data and more targeted training and optimization of the sub-field classifier, so that the process of recognizing the text field by the sub-field classifier can be more refined, and more accurate field recognition can be achieved.

The process of text domain identification by the above system is set forth below with reference to an illustrative example.

In this embodiment of the present application, the domains corresponding to the respective sub-domain classifiers in the classifier layer may be configured in advance. For example, the sub-domain classifiers with higher domain identification accuracy may be placed in the packets with higher priority, such as priority1, and the sub-domain classifiers with lower domain identification accuracy may be placed in the packets with lower priority, such as priority3, in the order of the domain identification accuracy of the sub-domain classifiers from high to low. In an exemplary implementation manner, the mobile phone may put the self-owned vertical classification task with the highest classification accuracy into priority1, put the vertical classification task belonging to the application docking into priority2, and put the vertical classification task which is most difficult to identify into priority 3.

After the tasks in butt joint with the application are processed by the sub-domain classifiers in the priority group, no matter which sub-domain classifier in the priority group is the obtained domain division result, the processing process of the voice instruction is not greatly influenced or hardly influenced. Since the sub-domain classifier set in the priority2 corresponds to an application, text processing procedures of different domains under the same application are generally handed to a dialog engine corresponding to the application for processing. That is, the text belongs to any one of the domains corresponding to the application, and is finally processed by the same dialog engine. Therefore, in an implementation manner of the embodiment of the present application, the vertical classification task interfacing with the application may be regarded as a task with a low requirement on the domain recognition accuracy, because no matter which domain the finally obtained domain recognition result is, as long as the valid domain recognition result is generated in the priority2, the text is finally handed to the same dialog engine for processing, and the processing result is not affected.

For example, the priority2 includes a sub-domain classifier 21, a sub-domain classifier 22, and a sub-domain classifier 23, where the sub-domain classifier 21 corresponds to a domain of stocks, the sub-domain classifier 22 corresponds to a domain of translations, the sub-domain classifier 23 corresponds to a domain of calculations, and the stocks, translations, and calculations correspond to the same application, i.e., the same dialog engine. That is, in priority2, the handset will eventually push the text to the same dialog engine for processing, regardless of which domain of stock, translation, and computation the text is determined to belong to. Therefore, the domain identification result obtained by the priority2 performing the domain identification on the text does not influence the processing result of the text processing performed by the subsequent dialog engine no matter which domain is related in the priority 2.

It should be noted that, the fields corresponding to the respective sub-field classifiers in the priority2 may correspond to two or three dialog engines, but there are cases where a plurality of fields corresponding to the sub-field classifiers correspond to the same dialog engine.

The vertical classification task is executed by a sub-field classifier corresponding to the field; the sub-domain classifier corresponding to the self-owned vertical classification task can include but is not limited to a sub-domain classifier corresponding to functions to which the self-owned functions of the mobile phone belong, such as a sub-domain classifier corresponding to the setting, disturbance-free and gallery domains; the sub-domain classifier corresponding to the vertical classification task docked by the third party can include but is not limited to an application program installed in a mobile phone, or a sub-domain classifier corresponding to functions which can be realized by a program which can be directly called without downloading or installing, such as an applet, for example, a sub-domain classifier corresponding to stock, translation, calculation and weather domains; the sub-domain classifier corresponding to the vertical task that is most difficult to identify may include, but is not limited to, a sub-domain classifier corresponding to a domain for which it is difficult to determine a domain identification result according to the keyword, for example, a sub-domain classifier corresponding to a domain having a search function, such as an encyclopedic.

It can be seen that, in an exemplary implementation, the distribution of each sub-domain classifier in the classifier layer is as follows:

priority 1: setting sub-field classifiers corresponding to the disturbance-free and gallery fields respectively;

priority 2: the stock, translation, calculation and weather fields respectively correspond to a sub-field classifier;

priority 3: a sub-domain classifier corresponding to the encyclopedia domain.

For example, when the text input to the system is "i want to set a do-not-disturb", the control layer does not obtain an effective domain identification result through the fast full-precision matching of the text. When the text is processed by the classifier layer, an effective domain identification result is obtained through parallel processing of each sub-domain classifier in the priority1, namely, the domain corresponding to the text is free of disturbance. And then the system feeds back the obtained domain identification result to the mobile phone. The above example is presented in the form of:

text corresponding to the voice input by the user: i want to set up a do-not-disturb

The process is as follows: [ I want to set a do not disturb ] < nodistrurb, priority1>

And directly returning to the mobile phone after obtaining an effective domain identification result from the sub-domain classifier in the priority 1.

And (3) field identification result: [ nodistrurb ] return from < priority1>

It should be noted that the above-mentioned domain identification process involves the controller layer and the processing of each sub-domain classifier in the priority level 1 of the classifier layer.

For another example, when the text input to the system is "see a stock market", the control layer does not obtain an effective domain recognition result through the text fast full-precision matching. When the text is processed by the classifier layer, the domain recognition result obtained by parallel processing of each sub-domain classifier in the priority1 is other (other). And the text is processed by a next-priority sub-domain classifier. And then, carrying out parallel processing on each sub-field classifier in the priority2 to obtain an effective field identification result, namely that the field corresponding to the text is the stock. And then the system feeds back the obtained domain identification result to the mobile phone. The above example is presented in the form of:

text corresponding to the voice input by the user: to see the stock market

The process is as follows: [ see stock market ] < other, priority1>

And (4) obtaining no effective domain identification result from the sub-domain separator in the priority level 1, wherein the obtained domain identification result is other, and handing the text to the sub-domain classifier in the priority level 2 for processing.

[ see stock market ] < stock, priority2>

And directly returning to the mobile phone after obtaining an effective domain identification result from the sub-domain classifier in the priority 2.

And (3) field identification result: [ stock ] return from < priority2>

It should be noted that the above-mentioned domain identification process involves the processing of each sub-domain classifier in the controller layer and the classifier layer priority1 and priority 2.

In this embodiment, if the priority2 includes the sub-domain classifier corresponding to the encyclopedia, the domain identification result obtained from the sub-domain classifier in the priority2 includes the stock and the encyclopedia, or one of the stock and the encyclopedia, and then the domain identification result returned to the mobile phone has a greater error probability. Therefore, in the classifier layer, the priority classification of the sub-domain classifier is very important. For a domain that is ambiguous or difficult to distinguish, the sub-domain classifier corresponding to the domain may be placed in a packet with a lower priority. Therefore, after the high-priority packet obtains an effective domain identification result, the text does not need to be input into the low-priority packet for domain identification, and the low-priority domain identification pressure is reduced.

For another example, when the text input into the system is "query five grains", the control layer does not obtain an effective field recognition result through the text fast full-precision matching. When the text is processed by the classifier layer, the domain identification result obtained by the parallel processing of each sub-domain classifier in the priority1 is other. And (4) handing the text to the next priority sub-domain classifier for processing, namely handing the text to each sub-domain classifier in the priority2 for parallel processing, wherein the obtained domain identification result is still other. And then, the text is processed by a sub-field classifier of the next priority, and an effective field identification result is obtained through parallel processing of each sub-field classifier in the priority3, namely the field corresponding to the text is encyclopedia. And then the system feeds back the obtained domain identification result to the mobile phone. The above example is presented in the form of:

text corresponding to the voice input by the user: inquiry of wuliangye

The process is as follows: [ Inquiry wuliangye ] < other, priority1>

[ Inquiry wuliangye ] < other, priority2>

And (4) obtaining no effective domain identification result from the sub-domain separator in the priority2, wherein the obtained domain identification result is other, and handing the text to the sub-domain classifier in the priority3 for processing.

[ Inquiry wuliangye ] < Baike, priority3>

And directly returning to the mobile phone after obtaining an effective domain identification result from the sub-domain classifier in the priority 3.

And (3) field identification result: [ Baike ] return from < priority3>

It should be noted that the above-mentioned domain identification process involves the controller layer and the processing of each sub-domain classifier in the priority1, the priority2, and the priority3 of the classifier layer. In addition, the text is more ambiguous and has a certain probability of being identified as stock, encyclopedia, and other. In the embodiment of the application, the field which is most easily recognized by the text is placed in the lowest priority of the classifier layer, so that the conflict among all priority groups can be effectively reduced, and the recognition pressure of the sub-field classifier with the high priority at the upper layer is reduced.

In the implementation process, the mobile phone can make full use of local user data, and can effectively perform field identification under the condition that the mobile phone does not perform data interaction with the cloud. The local user data refers to data stored locally in the mobile phone, for example, data stored in a memory of the mobile phone. This data includes, but is not limited to, the content contained in the various libraries involved in the system. Therefore, the time consumed by data interaction between the mobile phone and the cloud end is saved, in the field identification process, the plurality of sub-field classifiers with the same priority can complete identification operation at the same time, and the time consumed in the field identification process can be effectively saved.

In the embodiment of the application, the priorities of the classifier layers can be divided according to the characteristics of different field categories and the accuracy and performance of the corresponding models of the classifiers in each sub-field. For common descriptions or fixed descriptions which are easy to generate ambiguity, the descriptions can be set in the text full-precision matching process of the control layer, the processing efficiency of field identification is effectively improved, and the time occupied by the field identification process is saved. Except the above description, the method can sequentially enter the sub-field recognition classifiers with different priorities to perform the multi-field parallel field recognition process according to the sequence of the priorities from high to low, so that the processing efficiency of the field recognition process is further improved, and the processing time is saved. It should be noted that, the above division of the priority can also make effective use of the sub-domain classifier with poor classification effect, that is, put the sub-domain classifier into the group with lower priority.

In the system, the recognition capability of the sub-domain classifier affects the domain recognition result, and the training of the sub-domain classifier affects the recognition capability of the sub-domain classifier, so the training of the sub-domain classifier is very important.

Fig. 6 is a flowchart of an exemplary method for training a sub-domain classifier under the condition that the text belongs to the domain, according to the embodiment of the present application. The method comprises steps S201 to S208.

S201, inputting a text.

S202, screening the text through the rule.

In an embodiment of the present application, the rule may be a schema such as [ [ a (search | view | tell | open) ] {1,12} (strands) $ ]. Wherein, the 'Lambda' is used as the initial character of the rule, which means that 'search', 'check', 'see', 'tell' or 'open' is used as the initial keyword, after the interval of 1 to 12 words, the 'stock' two-word text is used as the ending keyword, and the '$' is used as the ending character of the rule, which means that 'stock' is ended.

The initial keyword means that the first word in the text is "search", "check", "see", or the first word in the text is "tell" or "open", and the searched content is considered as the initial keyword; the end keyword refers to the last word in the text as the "stock".

It should be noted that the start symbol and the end symbol appear as optional symbols in the sentence, and are not meant to limit the embodiments of the present application. For example, the rule may be a sentence in the form [ (search | view | tell | open) {1,12} (strands) ]. Then, the sentence represents the text of "stock" two words with "search", "look", "see", "tell", or "open" as the starting keyword, followed by 1 to 12 words apart, as the ending keyword. Wherein the first word in the text is not necessarily "search", "check", "see", or the first word in the text is not necessarily "tell" or "open", but there is "search", "check", "see", "tell" or "open" in the text. In addition, after the interval of 1 to 12 characters after the text is searched, looked up, watched, told or opened, the "stock" of the two characters exists, and the "stock" is not necessarily the last word appearing in the text.

That is, the rule may include a start symbol, an end symbol, or both a start symbol and an end symbol, but is not limited thereto.

And S203, returning a field recognition result when the text meets the rule.

With reference to the above description, for a text that can match the rule, the domain to which the text belongs may be directly determined, and thus the domain identification result is determined and returned.

And S204, when the text does not meet the rule, NER is carried out on the text, and the public feature replacement is completed.

In an implementation manner of the embodiment of the present application, the common features include, but are not limited to, words such as time and place, and may be preset. In the embodiments of the present application, common features may be replaced with symbols and the like, which are not limited herein.

In one implementation of the embodiment of the present application, the NER is performed on the text, words such as time and place in the text can be recognized, and then the recognized content is used as a common feature, and the common feature is exchanged by using a preset symbol or the like.

And S205, performing feature extraction on the replaced text.

The feature extraction refers to performing word extraction on the text which is subjected to replacement according to a binary method, a ternary method and the like, for example, performing word extraction on the text which is subjected to replacement according to the binary method to obtain words formed by a plurality of groups of two words, or a combination formed by one word and one symbol, or a combination formed by two symbols, and the like.

And S206, calculating the weight of each feature.

It should be noted that, the manner of calculating the weight according to the feature may refer to the implementation manners such as the binary method and the ternary method in the prior art.

For example, taking a binary method as an example, the numerical values corresponding to the features obtained by splitting the binary method are input into the model, and the model is calculated by an algorithm such as Linear Regression (LR) to output weights corresponding to the number of the features, that is, one feature corresponds to one weight. For the parameters of the input model, the values corresponding to different features are different and may be preset, and the specific setting mode is not limited herein. In the embodiment of the present application, for the manner of model calculation, reference may be made to an algorithm provided in the prior art, for example, the LR algorithm described above, which is not described herein again.

And S207, calculating a value corresponding to the text according to the weight.

And calculating the value of the replaced text, namely the value of the input text according to the weight of the features. For example, a method of summing weights of all features in a text to obtain a value corresponding to the text, or a method of processing and then summing the weights of each feature to obtain a value corresponding to the text, which is not limited herein.

And S208, adjusting the sub-domain classifier according to the calculated value and the known domain identification result.

Since the above-mentioned S201 to S208 train the sub-domain classifier according to the text in the known domain, the sub-domain classifier can be adjusted according to the recognition result obtained by the sub-domain classifier and the domain to which the text actually belongs. The manner of adjusting the sub-domain classifier includes, but is not limited to, adjusting positive and negative samples in the sub-domain classifier. It should be noted that adjusting the positive and negative samples affects the weight of the features, and finally affects the calculated value corresponding to the text, thereby affecting the field recognition result.

After the adjustment process is completed, the mobile phone can continue to use the same text and is processed by the same sub-domain classifier again until a correct domain identification result is obtained. In other words, in the training process of the sub-domain classifier, the contents shown in S201 to S208 are repeated until the training purpose is achieved.

Fig. 7 is a flowchart of an exemplary training method for adjusting positive and negative samples of a sub-domain classifier according to an embodiment of the present disclosure. The method comprises steps S301 to S310.

S301, generating positive and negative samples of the sub-domain classifier.

In the embodiment of the present application, each sub-domain classifier may have its own independent positive and negative samples, where the positive and negative samples include a positive training sample set and a negative training sample set. The samples in the positive example training sample set are samples belonging to the corresponding field of the sub-field classifier, and the samples in the negative example training sample set are samples not belonging to the corresponding field of the sub-field classifier.

S302, NER and rule extraction is carried out on the positive and negative samples.

For example, the text content of the positive and negative samples is "photos of search for Tiananmen", after the NER, the "Tiananmen" is identified, and the extracted rule is a sentence pattern in the form of [ < Lambda > (photos of {1,10 }) through rule extraction. Thus, the "Tiananmen" obtained after the NER is used as the common feature, and [ ^ (search). {1,10} (photograph) $ ] is used as the rule.

And S303, completing common feature replacement.

In one implementation of the embodiment of the present application, it may be predefined that the place name such as Tiananmen is replaced by #, and then the text content completing the replacement of the common feature is "a photo of search #".

For S302 and S303, reference may be made to the description of S202 to S205, which is not repeated herein.

It should be noted that performing NER on positive and negative examples may be a prerequisite for rule extraction and common feature replacement. That is, the position, time, period, etc. in the positive and negative samples are identified by the NER, then the period is used as a rule, the time, position, etc. are used as common features, and the substitution between the common features and the symbols is completed.

S304, denoising stop words and the like.

In the embodiment of the present application, stop words refer to words, terms, or symbols that do not have a decisive role in the field recognition, but the presence of these words, terms, or symbols may often affect the accuracy of the field recognition result, for example, "; "," and the like. These stop words are recognized and ignored in the domain recognition process.

S305, extracting the features to generate a training corpus feature library.

The corpus feature library is used to record the correspondence between the features obtained by the calculation in S206 and the weights.

And S306, calculating the value corresponding to the text according to the weight.

And S307, training a sub-field classifier.

For a specific training process, reference may be made to the implementation processes of S201 to S207, which are not described herein again.

And S308, evaluating the influence of the error field identification result.

And S309, modifying the positive and negative samples.

S308 and S309 are similar to the purpose of S208, and in the embodiment of the present application, modifying the positive and negative samples may be used as an exemplary implementation manner of S208.

The training process of the sub-domain classifier by the system is explained below with reference to an exemplary example.

In an exemplary implementation manner, taking a sub-domain classifier corresponding to a stock domain as an example, a training sample and a domain identification result obtained after the sample is processed by a system include the following contents:

the first round of processing results of the system on the training sample 1 and the training sample 2 are as follows:

training sample 1

Text corresponding to the voice command input by the user: same flower with same stock market

And (3) field identification result: [ stock ] return from < priority2>

Training sample 2

Text corresponding to the voice command input by the user: same flower order stir-fried stock

And (3) field identification result: [ stock ] return from < priority2>

In the embodiment of the present application, "same style and sequence" is the name of not only a listed company but also a certain application, wherein the application is used for stock production. In training sample 1, a user tries to query for stocks of the same florescence; in the training sample 2, the user wants to open an application named as the same style for stock practice. Therefore, the field recognition result obtained by the training sample 1 is accurate, and the field recognition result obtained by the training sample 2 is erroneous.

In the following, a binary method is taken as an example, and a text obtained by converting a voice instruction input by a user is divided according to the binary method to obtain a plurality of groups of characteristics consisting of two characters.

In training sample 1, the features and the weights corresponding to the features are as follows:

carrying out flower homology: 0.33474357

And (3) straightening the flowers: 0.23474357

Stranding: 0.30918131

Stock market: 1.57149447

Value of text: 0.33474357+0.23474357+0.30918131+1.57149447 ═ 2.45016292

In training sample 2, the features and the weights corresponding to the features are as follows:

carrying out flower homology: 0.33474357

And (3) straightening the flowers: 0.23474357

And (3) stir-frying: -0.34392488

And (3) frying the thighs: -1.34392488

Stock: 1.99415611

Value of text:

0.33474357+0.23474357-0.34392488-1.34392488+1.99415611＝1.87579349

it should be noted that the weight of the feature may be positive, negative or 0. In the embodiment of the present application, the larger the weight value of the feature is, the larger the contribution of the feature to the sub-field (in this example, the "stock" field) in which the text is recognized as being located is.

In one implementation of the embodiment of the present application, 1.5 is used as a threshold value of a text, and when the sum of weights of all features in the text is greater than or equal to 1.5, the text is confirmed to belong to the stock field; when the sum of the weights of all features in this document is less than 1.5, it is confirmed that the text does not belong to the stock domain. Since the weight of each feature involved in the training sample 1 is a positive number, a correct domain identification result, that is, the domain is a stock, can be calculated according to the weight. In the training sample 2, due to the existence of noise "pan", and the absolute value of the negative weight values of the features "pan" and "stock" combined with "pan" is too small, the training sample 2 obtains a positive number greater than the threshold value after the weights of the features are summed, and therefore, the text can still be mistakenly recognized as the stock field.

In order to correct the problem of misrecognition in the first round of processing, before the second round of processing, the system adjusts the positive and negative samples according to the results that the training sample 1 is correctly recognized and the training sample 2 is incorrectly recognized. Delete the content with [ syn-flor ] in the positive sample and add [ syn-flor ] and [ stock stir-fried ] in the negative sample.

It should be noted that, in a normal case, the value of the weight corresponding to the feature related to the content added in the positive sample is increased, and the value of the weight corresponding to the feature related to the content deleted in the positive sample is decreased; similarly, the value of the weight corresponding to the feature related to the content added in the negative sample is decreased, and the value of the weight corresponding to the feature related to the content deleted in the negative sample is increased.

For example, if the content with [ cis ] is deleted from the positive sample, the values of the weights corresponding to the features "cis" and "cis" are reduced. And increasing [ same flower sequence ] in the negative sample can further reduce the values of the weights corresponding to the characteristics of 'same flower' and 'same flower sequence'.

But for increasing [ stock ] in the negative sample, in one implementation of the embodiment of the present application, the weight corresponding to the features "stock" and "stock" respectively is not affected. The reason why no influence is generated may be that the sample capacity of the content with [ stock ] in the positive sample and the negative sample is large, so that the influence on the negative sample is small after one sample with [ stock ] is added in the negative sample, for example, the number of the samples with [ stock ] in the positive sample is twenty thousand, the number of the samples with [ stock ] in the negative sample is ten thousand, and the negative sample with [ stock ] is added, so that the positive sample and the negative sample with huge data volume are not influenced, and therefore, the influence on the weights respectively corresponding to the features of "stock" and "stock" is almost zero, and the values of the weights respectively corresponding to "stock" and "stock" are not changed. Therefore, in an implementation manner of the embodiment of the present application, the foregoing deletes the content with [ cistron ] in the positive sample, and adds [ cistron ] and [ stock ] in the negative sample, so that the values of the weights corresponding to the features "cistron" and "cistron" are reduced, without affecting the values of the weights corresponding to the features "stock" and "stock", respectively.

The above case is an exemplary implementation manner, and is not intended to limit the embodiments of the present application.

After the first positive and negative sample adjustment, in the training sample 1, each feature and the weight corresponding to each feature are as follows:

carrying out flower homology: -0.34743574

And (3) straightening the flowers: -0.34743574

Stranding: 0.30918131

Stock market: 1.57149447

Value of text: -0.34743574-0.34743574+0.30918131+1.57149447 ═ 1.1858043

After the first positive and negative sample adjustment, in the training sample 2, each feature and the weight corresponding to each feature are as follows:

carrying out flower homology: -0.34743574

And (3) straightening the flowers: -0.34743574

And (3) stir-frying: -0.34392488

And (3) frying the thighs: -1.34392488

Stock: 1.99415611

Value of text:

-0.34743574-0.34743574-0.34392488-1.34392488+1.99415611＝-0.38856513

after the first positive and negative samples are adjusted, the weight of part or all of the features is changed due to the change of the positive and negative samples, so that the processing result is also influenced to a certain extent. I.e., the value of the text in both training sample 1 and training sample 2 is less than 1.5, meaning that both training samples are identified as not belonging to the stock domain. It should be noted that, for the same feature, when the feature belongs to a positive sample, the weight corresponding to the feature is larger; when the feature belongs to the negative sample, the weight corresponding to the feature is smaller; when the feature belongs to both the positive and negative examples, the weight value of the feature is weighted according to the number of the positive and negative examples containing the feature.

After the first positive and negative samples are adjusted, the system performs the following second round of processing on the training sample 1 and the training sample 2:

training sample 1

And (3) field identification result: [ other ] return from < priority3>

Training sample 2

And (3) field identification result: [ other ] return from < priority3>

Wherein the field recognition result obtained by the training sample 1 is wrong, and the field recognition result obtained by the training sample 2 is correct. It should be noted that, when the training sample is input, the correct domain identification result corresponding to the training sample can be input, so that the mobile phone can automatically adjust the positive and negative samples according to the known correct domain and by combining the output domain identification result; or after the field identification result is output, manually judging whether the field identification result is correct or not, and triggering the mobile phone to automatically adjust the positive and negative samples under the condition that the result is wrong.

Therefore, the system automatically adjusts the positive and negative samples again, i.e., the system automatically adjusts the positive and negative samples a second time. The system readjusts [ co-cisterns ] in the positive sample on the basis of the first adjustment of the positive and negative samples, for example, the content of [ co-cisterns ] in the positive sample is added. Therefore, the values of the weights corresponding to the characteristics of the same flower and the same flower sequence can be effectively improved.

After the second positive and negative sample adjustment, in the training sample 1, each feature and the weight corresponding to each feature are as follows:

carrying out flower homology: -0.03474357

And (3) straightening the flowers: -0.03474357

Stranding: 0.30918131

Stock market: 1.57149447

Value of text: -0.03474357-0.03474357+0.30918131+1.57149447 ═ 1.81118864

After the second positive and negative sample adjustment, in the training sample 2, each feature and the weight corresponding to each feature are as follows:

carrying out flower homology: -0.03474357

And (3) straightening the flowers: -0.03474357

And (3) stir-frying: -0.34392488

And (3) frying the thighs: -1.34392488

Stock: 1.99415611

Value of text:

-0.03474357-0.03474357-0.34392488-1.34392488+1.99415611＝-0.76318079

after the second positive and negative sample adjustment, the third round of processing results of the system on the training sample 1 and the training sample 2 are as follows:

training sample 1

Text corresponding to the voice input by the user: same flower with same stock market

And (3) field identification result: [ stock ] return from < priority2>

Training sample 2

Text corresponding to the voice input by the user: same flower order stir-fried stock

And (3) field identification result: [ other ] return from < priority3>

In the embodiment of the application, the system adjusts the positive and negative samples in combination with the correctness or the inaccuracy of the domain recognition result of each round until the training sample 1 and the training sample 2 both obtain the correct domain recognition result. Therefore, the more the number of training samples is, the higher the accuracy of the adjusted positive and negative sample sets is.

In order to reduce the interference of stop words, numbers, place names and other contents to the field identification process, in the embodiment of the application, the field identification result can be determined by identifying a sentence pattern, or the field identification process can be simplified by replacing interference terms. Therefore, the accuracy of the field identification process can be improved, and the time occupied by the field identification process can be further saved.

In an exemplary implementation manner, for sentence patterns that are difficult to be recognized by the sub-domain classifier or sentence patterns that are easy to have a large influence on the domain recognition result, rules may be set in advance based on the sentence patterns for the text full-precision matching process of the control layer.

For example, the text contents obtained by speech recognition and the domain recognition results obtained by the system in examples 1 to 3 are as follows:

example 1:

text corresponding to the voice command input by the user: query the next strand of big carbon

And (3) field identification result: stock certificate

Example 2:

text corresponding to the voice command input by the user: enquiring about stocks 600160 of south-China-Hongkong and Jiangtong copper CWB1

And (3) field identification result: stock certificate

Example 3:

text corresponding to the voice command input by the user: searching pictures shot in Beijing yesterday

And (3) field identification result: picture library

For the text shown in example 1, a sentence pattern "strand of query … …" may be preset, so that when the user inputs an error or the speech recognition generates an omission, as long as the text includes the sentence pattern, the system can accurately recognize the sentence pattern, and distinguish the fields of the text according to the sentence pattern, thereby obtaining an accurate field recognition result.

For example, the rule may be preset to [ $ (strand of [ $) (search | view | tell | open) ] {1,12} ], so that the system can perform recognition and matching of fast reading of the text, and feed back the obtained domain recognition result to the mobile phone. The meaning of sentence "[ ^ (search | look | tell | open) {1,12} (strand) $ ]" can be referred to the above description, and is not repeated herein.

In an exemplary implementation, a classifier layer is still required to process text that the above rules cannot match.

Taking example 3 as an example, in the embodiment of the present application, "search for … … pictures" may be taken as a mode. Thus, for the gallery field, the rule related in the sub-field classifier corresponding to the gallery field may include the pattern, that is, in the text recognition process, when the sub-field classifier recognizes the pattern, an effective field recognition result may be fed back, that is, the text field is the gallery.

Taking example 2 as an example, a common feature can be set for the system in advance to prevent the problem of inaccurate field identification caused by the common feature. When the system processes the text, the continuous 6 digits in the text can be replaced, for example, [600160] is replaced by @, so that the content of the text is "stock @" for inquiring about beijing hong and jiang bronze CWB 1. Then the system can call NER to extract NE information in the text as a common feature, for example, defining [ Nanjing harbor ] as an entity of 'common company name', and replacing the entity with #; define [ Jiangtong CWB1] as an "listed company name code" entity, replace this entity with @. Then the content of the text is "stock of query # and @".

Similarly, time in the text can be replaced by $, and the location can be replaced by #, and then the content of the text in example 3 is "search for a picture in $, #.

After the above replacement process is completed, the individual features and the weights corresponding to each feature in example 2 are as follows:

and (3) inquiring: 0.1067646020633481

Inquiring: -0.10021895439172483

The following: -0.215034710246020433

The following #: 0.1067646020633481

# and: null

Where null indicates that the features "# and" have no influence on the result of the field recognition or that the weight of the feature is 0.

And @: 0.12009207293891772

@ thigh: 0.304457783445201952

Stock: 1.1114948005328673

Ticket @: 0.3067646020633481

After the above replacement process is completed, the individual features and the weights corresponding to each feature in example 3 are as follows:

searching for one: 0.3835541240544907

The following: -0.2517062504931636

The following $: 0.14542119078470123

The method comprises the following steps: 0.094333521958256

In #: 0.19608161704432386

Pat # time: -0.006875871484002316

Beating: 0.5827998208565368

The following drawings: 0.26154773801450293

Picture: 0.17497209951796067

+:1.4622835953886275

The "+" feature represents that the replaced text meets the mode defined in the sub-domain classifier, so that when the value corresponding to the text is calculated, the replaced text meets the mode defined in the sub-domain classifier, and weight addition is obtained, so that the accuracy of domain identification is improved.

The value corresponding to the replaced text is as follows: 3.04241158392

It can be seen that, due to the above replacement process, common features such as time, place, etc. are replaced, and the common features often include at least two words, which means that the number of the obtained feature-weight correspondence is reduced after the replacement is completed. Especially for the situation that more common features are involved in the text, the alternative mode can effectively simplify the calculation process of the sub-field classifier, and therefore the working efficiency of the sub-field classifier is improved. Moreover, the alternative mode can effectively reduce the interference of the public characteristics to the field identification.

In several different fields, the voice playing content or the display result obtained based on the voice command is exemplified below.

Text corresponding to the voice command input by the user: enlarging the character by a bit

And (3) response: good, adjusted for you

The mobile phone performs field recognition on the content obtained after voice recognition as a text with a character style slightly enlarged, and the obtained field recognition result is that the text belongs to the setting field. And then the mobile phone sends the text to a dialogue engine corresponding to the setting field for processing. It should be noted that, when the mobile phone gives the user a response, the mobile phone has turned the font larger according to the user's request.

Text corresponding to the voice command input by the user: please help me to set a no-disturbance from two to three points in the afternoon of today except for the old king

And (3) response: no disturbance is started from 14:00 to 15:00 except for the old

The mobile phone performs field recognition on a text which is obtained after voice recognition and is used for helping I to set up a disturbance-free text from two points to three points in the afternoon of the day except for the old king, and the obtained field recognition result is that the text belongs to the disturbance-free field. And then the mobile phone sends the text to a dialogue engine corresponding to the disturbance-free field for processing. It should be noted that, when the mobile phone gives the user a response, the mobile phone has set the disturbance-free turn-on time according to the requirement of the user, and it is ensured that the user is still prompted for the incoming call of the old king in the disturbance-free time period.

Text corresponding to the voice command input by the user: please help the picture of my hundred degrees next ice

And (3) response: jump to hundred degree display demonstration ice related photo

The mobile phone performs field recognition on a text which is obtained after voice recognition and has the content of 'please help one hundred degrees of a picture of ice-skating', and the obtained field recognition result is that the text belongs to the field of a gallery. And then the mobile phone sends the text to a dialogue engine corresponding to the gallery field for processing. It should be noted that, when the mobile phone gives the user a response, the mobile phone has completed the picture search according to the request of the user, that is, the user has been presented with the relevant photos by hundred degrees currently.

Text corresponding to the voice command input by the user: how to say chopsticks in English

And (3) response: chopsticks

The mobile phone performs field recognition on a text which is obtained after voice recognition and has the content of 'how to say chopsticks in English', and the obtained field recognition result is that the text belongs to the translation field. And then the mobile phone sends the text to a conversation engine corresponding to the translation field for processing. It should be noted that, when the mobile phone gives the user a response, the mobile phone has completed the translation of the word "chopsticks" according to the user's requirement.

Text corresponding to the voice command input by the user: how much the weather is today

And (3) response: shanghai today's light rain, 7-10 deg.C, air quality index 88, air can also, please remember to bring umbrella

The mobile phone performs field recognition on a text with the content of 'how the weather is today' obtained after voice recognition, and the obtained field recognition result is that the text belongs to the weather field. And then the mobile phone sends the text to a dialogue engine corresponding to the weather field for processing. It should be noted that, when the mobile phone gives the user an answer, the mobile phone already determines the weather condition of the geographic location in combination with the geographic location where the user is currently located.

Text corresponding to the voice command input by the user: stock of Wanke

And (3) response: wanke A, 39.42 Yuan nowadays, 0.86% fall amplitude, already in Hou City

The mobile phone performs field recognition on a text which is obtained after voice recognition and contains the contents of 'stock of all the departments', and the obtained field recognition result is that the text belongs to the field of stock. And then the mobile phone sends the text to a dialogue engine corresponding to the stock field for processing. It should be noted that, when the mobile phone gives the user a response, the mobile phone has already determined the stock condition that the user wants to know.

Text corresponding to the voice command input by the user: power 13 of 2 equals how much

And (3) response: power 13 of 2 equals 8192

The mobile phone performs field recognition on a text which is obtained after voice recognition and has a content of '2 power of 13 equals to what', and the obtained field recognition result is that the text belongs to the field of calculation. And then the mobile phone sends the text to a dialogue engine corresponding to the computing field for processing. It should be noted that, when the mobile phone gives a response to the user, the mobile phone already determines the calculation result that the user desires to obtain through a calculation manner.

Text corresponding to the voice command input by the user: height of Yaoming

And (3) response: yaoming, born in Xuhui district of Shanghai city at 9/12 th 1980, Zhangzhen district Zujiang province of Suzhou city of ancestral Jiangsu, former professional basketball athletes of China, department center, and director and general manager of union of the incumbent and middle-officer. In 1998, 4 months, Yaoming started basketball career in the non-religious country. 2001 seizes CBA general game MVP, seizes CBA general champion and general finals MVP in 2002, 3 times of selection of CBA backboard king and cap king are respectively carried out, and 2 times of selection of CBA basketball buckle king are respectively carried out.

The mobile phone performs field recognition on a text with the content of height of Yaoming obtained after voice recognition, and the obtained field recognition result is that the text belongs to the encyclopedic field. The mobile phone can then search the extracted keywords in encyclopedia and present the searched results to the user, and at the same time, the mobile phone can selectively present the searched related content to the user. It should be noted that, when the mobile phone gives the user a response, the mobile phone has searched the height of yaoming and other related information.

In the above examples, the responding mode includes, but is not limited to, a text prompting mode or a voice prompting mode, and is not limited herein.

In the embodiment of the present application, the electronic device may be divided into the functional modules according to the method embodiment, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

Fig. 8 is a schematic diagram illustrating an exemplary structure of the apparatus for performing speech recognition for the electronic device according to the above embodiment. The apparatus 400 for performing speech recognition by an electronic device includes: the system comprises a receiving module 401, a conversion module 402, a first domain identification module 403, a processing module 404, a second domain identification module 405, a control module 406 and a sub-domain classifier 407. The sub-domain classifier 407 includes a named entity identification module 4071, a replacement module 4072, an extraction module 4073, a calculation module 4074, and a domain determination module 4075. It should be noted that the electronic device 400 includes at least one sub-domain classifier 407, which is not limited herein.

The receiving module 401 is configured to support the electronic device 400 to receive a voice instruction. For example, the user inputs a voice command through the electronic device corresponding to the text, i.e., the voice input is shown in fig. 4. The conversion module 402 is used to support the electronic device 400 to convert the voice command into text, for example, as shown in fig. 4, the input voice is converted into text by means of voice recognition. The first domain identification module 403 is configured to support the electronic device 400 to identify a text through at least two sub-domain classifiers, so as to obtain a domain identification result. For example, the sub-domain classifiers in each priority (i.e. the sub-domain classifier group) in the classifier layer shown in fig. 4 identify the text, for example, the sub-domain classifier 11, the sub-domain classifier 12 and the sub-domain classifier 13 in priority1 identify the text in parallel. The processing module 404 is configured to enable the electronic device 400 to process the text through a dialog engine corresponding to the field to which the text belongs, to determine a function that the electronic device corresponding to the text needs to perform, and to enable the electronic device 400 to implement other processes of the technology described herein. The second sub-field recognition module 405 is configured to support the electronic device 400 to match the text field with the shipped and stored text, for example, as shown in fig. 4, the text in the control layer is quickly and accurately matched, and when the text is successfully matched with the pre-stored text, a field corresponding to the pre-stored text is determined to be a field recognition result of the text; when the matching between the text and the pre-stored text fails, the text is input into the classifier layer, and the first domain identification module 403 performs domain identification on the text through at least two sub-domain classifiers to obtain a domain identification result.

In one implementation manner of the embodiment of the present application, the first domain identification module includes N sub-domain classifier groups, where each group has a different priority, and N is a positive integer greater than or equal to 2. At least one of the N sub-domain classifier groups includes at least two sub-domain classifiers. Each sub-domain classifier is used for confirming whether the text belongs to the corresponding domain of the sub-domain classifier. The control module 406 is configured to support the electronic device 400 to control the sub-domain classifier in the highest priority group of the N sub-domain classifier groups to perform domain recognition on the text, for example, as shown in fig. 4, control the sub-domain classifier in the highest priority group of the classifier layer, i.e., priority1, to perform domain recognition on the text. If the sub-domain classifier in the highest priority group identifies the domain to which the text belongs, the sub-domain classifier in the highest priority group identifies the domain to which the text belongs as a domain identification result; if the sub-domain classifier in the highest priority group does not recognize the domain to which the text belongs, performing domain recognition on the text by the sub-domain classifier in the next priority group of the N sub-domain classifier groups, for example, as shown in fig. 4, after the text is subjected to the domain recognition by the sub-domain classifier in the priority1, a domain recognition result is not obtained, and then performing the domain recognition on the text by the sub-domain classifier in the priority2 until: identifying a field to which the text belongs, and taking the identified field as a field identification result; or the text is subjected to domain recognition through all the sub-domain classifiers in the N sub-domain classifier groups. For example, as shown in fig. 4, after the text passes through the domain recognition of all the sub-domain classifiers in the priority1, the priority2 and the priority3 of the classifier layer, the domain recognition result is not obtained, and then the processing procedure of the voice command at this time is ended.

The control module 406 is further configured to determine at least one of the first domain recognition result and the second domain recognition result as the domain recognition result or determine that the first domain recognition result and the second domain recognition result are both the domain recognition results when the first sub-domain classifier performs the domain recognition on the text to obtain the first domain recognition result and the second sub-domain classifier performs the domain recognition on the text to obtain the second domain recognition result. Taking priority1 shown in fig. 4 as an example, when the sub-domain classifier 11 obtains the first domain identification result and the sub-domain classifier 12 obtains the second domain identification result, the control module 406 performs the above process. It should be noted that, at this time, if the sub-domain classifier 13 obtains the third domain identification result, the control module 406 determines that at least one of the first domain identification result, the second domain identification result, and the third domain identification result is the domain identification result of the text.

In the sub-domain classifier 407, the NER module is used to support the electronic device 400 for NER on text and determine common features in the identified content. The replacing module 4072 is configured to enable the electronic device 400 to replace the utility feature in the text according to a preset rule. The extraction module 4073 is used to support the electronic device 400 to perform feature extraction on the replaced text and determine the weight of each feature. The calculation module 4074 is configured to enable the electronic device 400 to calculate the value of the text according to the weight of each feature. The domain determining module 4075 is configured to enable the electronic device 400 to determine that the text belongs to the domain corresponding to the sub-domain classifier when the value of the text is greater than the threshold. The sub-domain classifier 407 may be any one of the sub-domain classifiers involved in the classifier layer shown in fig. 4.

In one implementation of the embodiment of the present application, the electronic device 400 may further include at least one of a storage module 408, a communication module 409, and a display module 410. Wherein the storage module 408 is used to support the electronic device 400 to store program codes and data of the electronic device; the communication module 409 may support data interaction between various modules in the electronic device 400 and/or support communication between the electronic device 400 and other electronic devices such as servers, other electronic devices, and the like; the display module 410 may support the electronic device 400 to present the processing result of the voice command to the user in a text, a graphic, or the like, or selectively present the voice recognition process to the user in the voice recognition process, which is not limited herein.

The receiving module 401 and the communication module 409 may be implemented as transceivers; the conversion module 402, the first domain identification module 403, the processing module 404, the second domain identification module 405, the control module 406, and the sub-domain classifier 407 may be implemented as a processor; the storage module 408 may be implemented as a memory; the display module 410 may be implemented as a display.

In an implementation manner of the embodiment of the present Application, the Processor may also be a controller, and for example, the Processor may be a CPU, a general-purpose Processor, a Digital Signal Processor (DSP), an Application-Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other Programmable logic devices, transistor logic devices, hardware components, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The above-mentioned transceiver may also be implemented as a transceiving circuit or a communication interface, etc.

As shown in fig. 9, the electronic device 50 may include: a processor 51, a transceiver 52, a memory 53, a display 54, and a bus 55. The transceiver 52, the memory 53 and the display 54 are optional components, that is, the electronic device 50 may include one or more of the optional components. The processor 51, the transceiver 52, the memory 53, and the display 54 are connected to each other via a bus 55; the bus 55 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 9, but this does not indicate only one bus or one type of bus.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in Random Access Memory (RAM), flash Memory, Read Only Memory (ROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a Compact Disc Read-Only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in the same apparatus or may be separate components in different apparatuses.

The embodiment of the application provides a readable storage medium which comprises instructions. When the instructions are run on an electronic device, the instructions cause the electronic device to perform the method described above.

The present application provides a computer program product comprising software code for performing the above-mentioned method.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the embodiments of the present application in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present application and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the embodiments of the present application should be included in the scope of the embodiments of the present application.

Claims

A method for speech recognition by an electronic device, the method comprising:

converting the received voice instruction into a text;

performing field recognition on the text through at least two sub-field classifiers to obtain a field recognition result, wherein the field recognition result is used for representing the field to which the text belongs;

and processing the text through a dialog engine corresponding to the field to which the text belongs, and determining the function which needs to be executed by the electronic equipment and corresponds to the text.
The method of claim 1, wherein after said converting the received voice instruction to text, the method further comprises:

matching the text with a pre-stored text;

and when the matching of the text and the pre-stored text is successful, determining that the field corresponding to the pre-stored text is the field recognition result of the text.
The method according to claim 2, wherein the text is subjected to domain recognition by at least two sub-domain classifiers to obtain a domain recognition result, specifically:

and when the matching of the text and the pre-stored text fails, performing field recognition on the text through at least two sub-field classifiers to obtain a field recognition result.
The method according to any one of claims 1 to 3, wherein the electronic device comprises N sub-domain classifier groups, wherein each group has a different priority, N being a positive integer greater than or equal to 2;

the text is subjected to field recognition through at least two sub-field classifiers to obtain a field recognition result, specifically:

performing domain recognition on the text through a sub-domain classifier in a highest priority group of the N sub-domain classifier groups;

if the sub-domain classifier in the highest priority group identifies the domain to which the text belongs, the sub-domain classifier in the highest priority group identifies the domain to which the text belongs as the domain identification result;

if the sub-domain classifier in the highest priority group does not recognize the domain to which the text belongs, performing domain recognition on the text by the sub-domain classifier in the next priority group in the N sub-domain classifier groups until:

identifying a domain to which the text belongs, and taking the identified domain as the domain identification result; or

The text is subjected to domain recognition through all the sub-domain classifiers in the N sub-domain classifier groups;

at least one of the N sub-domain classifier groups includes at least two sub-domain classifiers.
The method of claim 4, wherein at least two of the sub-domain classifiers in at least one of the N sub-domain classifier groups perform domain recognition on the text in parallel.
The method according to claim 4 or 5, wherein the domain identification accuracy of the sub-domain classifiers in the low priority group is lower than the domain identification accuracy of the sub-domain classifiers in the high priority group in the N sub-domain classifier groups.
The method of any of claims 4 to 6, wherein at least one of the N sub-domain classifier groups comprises a first sub-domain classifier and a second sub-domain classifier, the method further comprising:

when the first sub-domain classifier performs domain recognition on the text to obtain a first domain recognition result and the second sub-domain classifier performs domain recognition on the text to obtain a second domain recognition result,

determining at least one of the first domain identification result and the second domain identification result as the domain identification result; or

And determining that the first domain identification result and the second domain identification result are both the domain identification results.
The method of any one of claims 1 to 5, wherein at least one of the at least two sub-domain classifiers performs domain recognition on the text, comprising:

carrying out named entity recognition NER on the text, and determining common characteristics in the recognized content;

replacing the public features according to a preset rule, wherein the preset rule comprises replacement contents corresponding to different types of public features;

extracting features of the replaced text, and determining the weight of each feature;

calculating the value of the text according to the weight of each feature;

and when the value of the text is larger than a threshold value, determining that the text belongs to the field corresponding to the sub-field classifier.
An electronic device, characterized in that the electronic device comprises:

the receiving module is used for receiving a voice instruction;

the conversion module is used for converting the voice instruction received by the receiving module into a text;

the first domain identification module is used for carrying out domain identification on the text converted by the conversion module through at least two sub-domain classifiers to obtain a domain identification result, and the domain identification result is used for representing the domain to which the text belongs;

and the processing module is used for processing the text through a dialog engine corresponding to the field to which the text belongs and determined by the first field recognition module, and determining the function which needs to be executed by the electronic equipment and corresponds to the text.
The electronic device of claim 9, further comprising:

and the second field identification module is used for matching the text with a pre-stored text, and when the text is successfully matched with the pre-stored text, determining that the field corresponding to the pre-stored text is the field identification result of the text.
The electronic device of claim 10, wherein the first domain identification module is specifically configured to:

and when the second field recognition module fails to match the text with the pre-stored text, performing field recognition on the text through at least two sub-field classifiers to obtain a field recognition result.
The electronic device according to any one of claims 9 to 11, wherein the first domain identification module includes:

n sub-domain classifier groups, wherein each group has a different priority, and N is a positive integer greater than or equal to 2; at least one of the N sub-domain classifier groups comprises at least two sub-domain classifiers; each sub-field classifier is used for confirming whether the text belongs to the field corresponding to the sub-field classifier;

a control module to:

controlling a sub-domain classifier in the highest priority group of the N sub-domain classifier groups to perform domain identification on the text;

if the sub-domain classifier in the highest priority group identifies the domain to which the text belongs, the sub-domain classifier in the highest priority group identifies the domain to which the text belongs as the domain identification result;

if the sub-domain classifier in the highest priority group does not recognize the domain to which the text belongs, performing domain recognition on the text by the sub-domain classifier in the next priority group in the N sub-domain classifier groups until:

identifying a domain to which the text belongs, and taking the identified domain as the domain identification result; or

The text has been subjected to domain recognition by all the sub-domain classifiers in the N sub-domain classifier groups.
The electronic device of claim 12, wherein at least two of the sub-domain classifiers in at least one of the N groups of sub-domain classifiers perform domain recognition on the text in parallel.
The electronic device according to claim 12 or 13, wherein in the N sub-domain classifier groups, the domain identification accuracy of the sub-domain classifiers in the low priority group is lower than the domain identification accuracy of the sub-domain classifiers in the high priority group.
The electronic device of any of claims 12-14, wherein at least one of the N sub-domain classifier groups comprises a first sub-domain classifier and a second sub-domain classifier,

when the first sub-domain classifier performs domain recognition on the text to obtain a first domain recognition result and the second sub-domain classifier performs domain recognition on the text to obtain a second domain recognition result,

the control module is further configured to:

determining at least one of the first domain identification result and the second domain identification result as the domain identification result; or

And determining that the first domain identification result and the second domain identification result are both the domain identification results.
The electronic device of any of claims 9-13, wherein the sub-domain classifier comprises:

the named entity recognition NER module is used for carrying out NER on the text and determining the public characteristics in the recognized content;

the replacing module is used for replacing the public features determined by the identifying module according to preset rules, and the preset rules comprise replacing contents corresponding to the public features of different categories;

the extraction module is used for extracting the features of the text which is replaced by the replacement module and determining the weight of each feature;

a calculation module, configured to calculate a value of the text according to the weight of each feature determined by the extraction module;

and the domain determining module is used for determining that the text belongs to the domain corresponding to the sub-domain classifier when the value of the text is greater than the threshold value.
An electronic device comprising memory, one or more processors, a plurality of applications, and one or more programs; wherein the one or more programs are stored in the memory; wherein the one or more processors, when executing the one or more programs, cause the electronic device to implement the method of any of claims 1-8.
A readable storage medium having stored therein instructions that, when executed on an electronic device, cause the electronic device to perform the method of any of claims 1-8.
A computer program product, characterized in that it comprises a software code for performing the method of any one of the preceding claims 1 to 8.