CN111210824A

CN111210824A - Voice information processing method and device, electronic equipment and storage medium

Info

Publication number: CN111210824A
Application number: CN201811390958.0A
Authority: CN
Inventors: 赵云杰; 张龙
Original assignee: Shenzhen Lutuo Technology Co Ltd
Current assignee: Shenzhen Lutuo Technology Co Ltd
Priority date: 2018-11-21
Filing date: 2018-11-21
Publication date: 2020-05-29
Anticipated expiration: 2038-11-21
Also published as: CN111210824B

Abstract

The application discloses a voice information processing method and device, electronic equipment and a storage medium. The method comprises the following steps: converting the acquired voice information into text information; processing the text information, and generating and recording current structured data; judging whether the current structured data is complete; if the current structured data is incomplete, performing state model matching on the current structured data and historical structured data to form complete target structured data, and if the current structured data is complete, taking the current structured data as the target structured data; and executing actions according to the target structured data. The method and the device can make full use of the acquired voice information and realize more natural and smooth conversation.

Description

Voice information processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of internet of things, and more particularly, to a method and an apparatus for processing voice information, an electronic device, and a storage medium.

Background

The intelligent home voice conversation assistant is a task-type voice conversation assistant for interacting with equipment accessing to the Internet of things. The voice conversation assistant mainly utilizes a man-machine conversation technology, and man-machine conversation refers to a more convenient man-machine interaction mode which realizes understanding of a machine to natural voice and generates response through a voice recognition technology. At present, most voice conversation assistants can only understand voice instructions in a fixed format, and if a user adopts natural language in daily life, the voice conversation assistants cannot correctly understand the voice instructions, so that human-computer conversation cannot be smoothly carried out.

Disclosure of Invention

In view of the above problems, the present application provides a method, an apparatus, an electronic device and a storage medium for processing voice information to solve the above problems.

In a first aspect, an embodiment of the present application provides a method for processing voice information, where the method includes: converting the acquired voice information into text information; processing the text information, and generating and recording current structured data; judging whether the current structured data is complete; if the current structured data is incomplete, performing state model matching on the current structured data and historical structured data to form complete target structured data, and if the current structured data is complete, taking the current structured data as the target structured data; and executing actions according to the target structured data.

In a second aspect, an embodiment of the present application provides a speech information processing apparatus, including: the conversion module is used for converting the acquired voice information into text information; the preprocessing module is used for processing the text information, and generating and recording current structured data; the judging module is used for judging whether the current structured data is complete or not; the processing module is used for carrying out state model matching on the incomplete current structured data and the historical structured data to form complete target structured data, and taking the complete current structured data as the target structured data; and the response module is used for executing actions according to the target structured data.

In a third aspect, an embodiment of the present application provides an electronic device, which includes one or more processors; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method as applied to an electronic device, as described above.

In a fourth aspect, the present application provides a computer-readable storage medium having a program code stored therein, wherein the program code performs the above method when running.

Compared with the prior art, according to the method, the device, the electronic device and the storage medium for processing the voice information, the obtained voice information is processed, each piece of structured data is generated and recorded for next round calling, whether the generated structured data is complete or not is judged, if the generated structured data is incomplete, matching of a state model is performed on the current structured data to form complete target structured data, corresponding actions are executed according to information in the target structured data, and if the generated structured data is complete, the corresponding actions are directly executed according to the current structured data. Through the information processing to the pronunciation, fully keep and utilize the information of every round of dialogue of user, the user need not every round all provide complete information when having a conversation for the language of dialogue is more natural and smooth, can shorten interactive time, promotes interactive experience.

These and other aspects of the present application will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 shows a schematic diagram of an application environment suitable for the embodiment of the present application.

Fig. 2 shows a flowchart of a voice information processing method according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating a method for processing voice information according to another embodiment of the present application.

Fig. 4 shows a diagram of the results of the intended classification in the embodiment shown in fig. 3.

Fig. 5 is a flowchart illustrating a method for processing voice information according to another embodiment of the present application.

Fig. 6 is a flowchart illustrating a voice information processing method according to still another embodiment of the present application.

Fig. 7 shows a block diagram of a speech information processing apparatus according to an embodiment of the present application.

Fig. 8 shows a block diagram of the structure of the answering module in the embodiment shown in fig. 7.

Fig. 9 shows a block diagram of an electronic device for executing a voice information processing method according to an embodiment of the present application.

Fig. 10 illustrates a storage unit provided in an embodiment of the present application and used for storing or carrying program codes for implementing a voice information processing method according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The internet of things is a huge network formed by combining various information sensors, collecting information needed by any object or process needing monitoring, connection and interaction in real time and the like with the internet. The common thing networking of life is applied to the commodity circulation trade, intelligent security and protection and thing networking intelligence house.

The intelligent home conversation assistant is a task-type voice conversation assistant for interacting equipment accessed to the Internet of things, and is mainly applied to a living scene at present, a user can interact with the equipment of the Internet of things through voice, control the equipment to execute a specified function, inquire the specified state of the equipment and the like. The voice conversation assistant can issue a read/write instruction to the equipment through the Internet of things cloud platform, so that a user can interact with the equipment. A speech assistant is a combination of speech recognition technology and natural language processing technology. With the maturity of voice recognition technology and the development of cloud computing, internet of things and intelligent hardware, the application requirements of voice conversation assistants are more and more extensive.

The inventor finds in research that the current intelligent home voice conversation assistant only supports a single round of conversation, each round of voice conversation has no relation with each other, and voice instructions given by each user must completely include interaction targets and interaction actions. For example, a user wants to control an air conditioner: "turn on the air conditioner", "turn the air conditioner to 25 degrees", "ask for how many degrees the air conditioner is now? "help me adjust air conditioner to high wind speed", each pair of dialog users need to provide complete information, namely including interaction target "air conditioner" and interaction actions "open", "adjust", "ask", etc. In natural language, since the four dialogs are related to each other, the user can omit the interactive object of "air conditioner" in the following three dialogs. Because the intelligent home voice conversation assistant only supports the characteristic of a single round of conversation at present, a user may need to provide some unnecessary information in each round of conversation, the language of the interactive command between the user and the equipment is unnatural, the interactive time is prolonged, the habit of the user is not met, and the user experience is not good.

Therefore, the inventor provides the voice information processing method which can enable the conversation to be more natural and smooth, the information of each round of conversation of the user is fully reserved and utilized through the voice information processing, and the user does not need to provide complete information in each round when the user carries out the conversation with the intelligent home conversation assistant, so that the language of the conversation is more natural and smooth, the interaction time can be shortened, and the interaction experience can be improved.

For the convenience of describing the scheme of the present application in detail, the following description will first describe an application environment of the embodiments of the present application with reference to the accompanying drawings.

Referring to fig. 1, an application environment 10 of a device control method according to an embodiment of the present application is shown, where the application environment 10 includes: server 96, gateway 97, other devices 98, target device 99, and electronic device 100. The gateway 97 is connected to the server 96 for information transmission, and the gateway 97 has a compatible physical interface and data format with the target device 99 and other devices 98, so as to control the target device 99 through the gateway 97. The target device 99 and the other devices 98 may be televisions, air conditioners, smart curtains, lights, projectors, etc. The target device 99 refers to a device that the user wants to control. The electronic device 100 may be a mobile phone, a tablet computer, a pc (personal computer) computer, a notebook computer, an intelligent wearable device, an intelligent television, a vehicle-mounted terminal, or other terminal devices. In this embodiment, the electronic device 100 is installed with a client for controlling the other device 98 or the target device 99, and the electronic device 100 further carries a voice assistant, so that the user can input a control command through the voice assistant. The gateway 97 may be connected to the server 96 through a router, and the electronic device 100 may be connected to the server 96 and the gateway 97 through a network. The server 96 may be a local server or a cloud server.

The following will describe embodiments of the present application in detail.

Referring to fig. 2, an embodiment of the present application provides a method for processing voice information, where an execution main body of a processing flow described in this embodiment may be an electronic device, a gateway, or a local server or a cloud server, and the method may include:

step S110, converting the acquired voice information into text information.

In one embodiment, a user can input voice information through a client in the electronic equipment through a voice assistant, and the electronic equipment directly processes the voice information.

In another embodiment, the user can send the voice information to the gateway or the server through the client in the electronic device by the voice assistant, and the gateway or the server processes the voice information.

In other embodiments, if the gateway also has a voice input interface or an audio acquisition module, the voice input interface includes an analog or digital audio input interface, and the audio acquisition module includes a microphone, etc.; the user can also directly input voice information through the gateway, and the gateway can directly process the voice information and can also send the voice information to the server for processing.

In the system shown in fig. 1, when the gateway or the server acquires the voice information, the acquired voice information may be segmented, and the voice information may be divided into a plurality of segments. And then converting the voice information into text information by utilizing a voice recognition technology.

And step S120, processing the text information, and generating and recording current structured data.

The textual information may be pre-processed before being processed into structured data. Preprocessing may include processing text information to remove stop words, to remove invalid characters, to normalize digital languages, or to correct homophones. The stop word may be a word that removes an actual meaning from the text message, such as a mood assist word. The invalidation of characters refers to the removal of characters in the text message that are not recognized by some programs. The digital language normalization is to unify the numbers and languages in the text message and express the numbers and languages in a form, for example, unify one and two into Arabic numbers 1 and 2. Homonym error correction refers to error correction of words with the same pronunciation but different fonts in text information.

In some embodiments, the structured data may be generated by using word segmentation, part-of-speech tagging, named entity recognition and template matching for text information converted from voice information. The word segmentation is to segment the text information into words with the smallest units in the sentence according to the dictionary. The part-of-speech tagging is the part-of-speech tagging of words in the text information after word segmentation, the named entity recognition is the entity for recognizing the specific meaning in the sentence, and the result of combining the part-of-speech tagging and the named entity recognition in the template matching finger fills the corresponding information into a preset template, thereby generating the structured data.

In other embodiments, the text information may also be processed using model matching, syntactic parsing, or search generalization to generate structured data.

After each piece of text information is processed to generate current structured data, all the data need to be recorded and stored, and the data are used as historical structured data in the next round of voice conversation.

Step S130, determining whether the current structured data is complete.

After the current structured data is generated, the structured data is analyzed to determine whether the current structured data is executable complete structured data, for example, whether the current structured data is executable complete structured data that can be read/written by a cloud.

And judging whether the current structured data is complete or not by judging whether the structured data contains necessary information or not, and if so, judging that the current structured data is complete.

In some embodiments, the necessary information may include valid device information and valid action information. The device information refers to abstract data of the matched intelligent device on the cloud platform, such as a device ID, a device name, a device attribute, and a value of the device data. Action information refers to abstract data of user interaction actions, such as action name, action category, action value. Valid device information means that fields describing the device information are not all empty, and valid action information means that fields describing the action information are not all empty. And when the current structured data contains valid equipment information and valid action information, judging that the current structured data is complete. And if the current structured data only contains effective equipment information or only contains effective action information, judging that the current structured data is incomplete.

For example, the structured data with text information "closed" is:

{Intent：control，

Position:null，

Object:{name:’null’，id:’null’，type:’null’，attribute:’null’},

action: { name: 'off', type: 'off', value: null }

Wherein Object represents device information and Action represents Action information. If the structured data contains the action information of "off" and the values of the device information are all null, the structured data only contains valid action information, and if the structured data does not contain valid device information, the structured data can be determined to be incomplete structured data.

Step S140, if the current structured data is incomplete, performing state model matching on the current structured data and historical structured data to form complete target structured data, and if the current structured data is complete, taking the current structured data as the target structured data.

If the current structured data is not complete structured data, matching can be performed with historical structured data, such as state model matching, and if the historical structured data is related to the current structured data, the historical structured data and the current structured data are merged into complete target structured data.

The historical structured data is the structured data generated in each round recorded in advance, and the structured data generated before the current structured data is generated can be used as the historical structured data. A state model is a predefined number of states that exist in one state machine.

When the current structured data is determined to be incomplete, the current structured data may lack valid device information, valid action information, or valid device information and action information, and may not be used as the target structured data. And matching the current structured data with the historical structured data by using a state model, and when the condition of a preset state mode in the state model is met, considering that the corresponding preset state mode is met to form complete structured data, and then taking the formed complete structured data as target structured data. And when the current structured data is judged to be complete, the current structured data is considered to simultaneously contain effective equipment information and effective action information, and the current structured data can be used as target structured data.

If the current structured data is already complete structured data, the current structured data is directly used as the target structured data, that is, the target structured data is complete structured data.

And step S150, executing actions according to the target structured data.

If the target structured data includes necessary information, such as valid device information and valid action information, the action can be executed according to the valid device information and valid action information in the target structured data.

According to the voice information processing method, the obtained voice information is processed, each structured data is generated and recorded for next round calling, whether the generated structured data is complete or not is judged, if the generated structured data is incomplete, matching of a state model is carried out on the current structured data to form complete target structured data, corresponding actions are executed according to the information in the target structured data, and if the current structured data is complete, the corresponding actions are directly executed according to the information in the current structured data. Through the processing of the voice information, the information of each round of conversation of the user is fully reserved and utilized, and the user does not need to provide complete information in each round when carrying out conversation, so that the language of the conversation is more natural and smooth, the interactive time can be shortened, and the interactive experience is improved.

Referring to fig. 3, another embodiment of the present application provides a method for processing speech information, where the present embodiment further describes a process of generating structured data by using intent classification and template matching based on the previous embodiment, and the method may include:

step S210, converting the acquired voice information into text information.

Specifically, reference may be made to the related description of step S110 in the previous embodiment, which is not repeated herein.

Step S220, performing intent classification on the text information, and determining whether the text information is an interactive control command.

The input data of the intent classification is the text information and the output data is the intent tag of this text information. The intention is the category of the task that this text information is intended to express. In the scenario depicted in FIG. 1, the category of intent may be query, control, or scenario execution, among others. The intended classifier can take a variety of methods, either a recurrent neural network model or a statistical model.

Referring to fig. 4, fig. 4 illustrates several results of the intent classification of textual information. The intention classification can be to judge the text information first, judge whether the intelligent home interaction is the type, if the intelligent home interaction is the type, then carry out more detailed classification on the type of the intelligent home interaction, and classify the type of the intelligent home interaction into the query type, the control type, the scene type, the timing type, the configuration type and other types of the intelligent home interaction.

When the intention is classified into a scene class, position information may also be included; when the intention is classified into the timing class, time information and the like may also be included. The structured data of the text that needs to be generated differs depending on the result of the intent classification.

After the intent classification is completed, an intent tag is output, and the intent of the text information can be judged according to the intent tag. According to the label of the intention classification, if the label of the intention classification of the text information is a control class, the intention of the text information can be judged to be an interactive control class, and the text information is an interactive control command.

Step S230, if the command is an interactive control command, performing word segmentation on the text information, and performing part-of-speech tagging and named entity recognition on the text information after word segmentation.

An interactive control command typically includes an intention, valid device information, and valid action information. And when the intention of the text information is judged to be the interactive control class, performing word segmentation processing on the text information. The text information is a sentence, the words are the minimum units in the sentence, and when the words are segmented, the words can be segmented according to the modern Chinese word segmentation standard for information processing. The words of the smallest unit in the sentence are divided.

For example, the text information is: today 12 o' clock turn the air conditioner in the bedroom to 25 degrees. Assuming that the number of 2018 is 6/8, text information can be divided into the following parts according to the modern word segmentation standard for information processing during word segmentation: today/12 o/m/bedroom/m/air conditioner/turn to/25/degree. And then, performing part-of-speech tagging on the information after word segmentation: today/12 o/bar/bedroom (slot _ position)/air conditioner (slot _ device)/tune to/25 (number)/degree. The named entity is identified as: today 12 points (2018-06-08-12: 00).

Step S240, generating the current structured data under the intention by mapping and analyzing the results of the part-of-speech tagging and the named entity recognition based on template matching.

In one embodiment, generating the current structured data generates one piece of structured data for one interaction, and multiple different interactions generate multiple pieces of structured data. For example, "turning on a light and turning on an air conditioner" are all on interactive actions, and then a piece of action information is generated, namely, turning on, and the equipment information is structured data of the light and the air conditioner. And turning on the light and turning the air conditioner to 25 degrees is two interactive actions, one is turning on and one is adjusting, two pieces of structured data are generated. The action information of one piece of structured data is on, the equipment information is a lamp, the action information of the other piece of structured data is adjusted to 25 degrees, and the equipment information is an air conditioner.

And after the part of speech tagging and the named entity recognition are carried out, a part of speech tagging result and a named entity recognition result are obtained, and the part of speech tagging result and the named entity recognition result are combined with a preset intelligent household vocabulary template and user equipment naming information acquired from a cloud end to generate structured data together.

The vocabulary template is a template describing operation attributes and corresponding cloud devices, for example, a cloud device is an air conditioner (device _ ac), an air conditioner has an attribute operation of opening (attribute _ on), and the vocabulary template may be: device _ ac, attribute _ on: the form of [ open, launch … … ] this triplet, the vocabulary template, is a collection of structures of this type. And generating structured data by combining the result of part of speech tagging and the result of named entity recognition according to the word list template, wherein the structured data of the sentence with the text information of 'adjusting the air conditioner of a bedroom to 25 degrees at 12 points today' is as follows:

{Intent：control，

position: { name: 'bedroom', id: 'position.001', type: 'room' },

object: { name: 'air conditioner', id: 'object.001', type: 'AC', attribute: 'AC _ state' },

action: { name: 'tune to', type: 'set', value: '25'},

Time:’2018-06-08-12-00-00-00’}

wherein, Intent Position of text information represents Position information, Object represents device information, Action represents Action information, and Time represents Time information.

Step S250, determining whether the structured data is complete.

Step S260, if the current structured data is incomplete, performing state model matching on the current structured data and historical structured data to form complete target structured data, and if the current structured data is complete, taking the current structured data as the target structured data.

Step S270, executing action according to the target structured data.

The steps S250 to S270 refer to corresponding parts of the foregoing embodiments, and are not described herein again.

According to the voice information processing method, the acquired voice information is converted into the text information, the text information is subjected to intention classification, word segmentation, part of speech tagging and named entity recognition are carried out on the text information according to the result of the intention classification, and the result of the part of speech tagging and the named entity recognition is combined with a preset intelligent home word list template and user equipment naming information acquired from the cloud side to generate the structured data under the intention classification. And converting the acquired voice information of each sentence into text information, processing the text information to generate structured data, and storing the information of each sentence. A basis for subsequent calls to the structured data is provided.

Referring to fig. 5, another embodiment of the present application provides a method for processing voice information, where the embodiment focuses on a process of processing incomplete structured data to form complete structured data, and the method may include:

step S310, converting the acquired voice information into text information.

Step S320, processing the text information, and generating and recording current structured data.

Step S330, determining whether the current structured data is complete.

Steps S310 to S330 may refer to corresponding portions of the foregoing embodiments, and are not described herein again.

Step S340, if not, traversing the historical structured data in a reverse order, and calling the first complete structured data found as the text to be matched.

After the text information is processed, the current structured data can be generated and recorded, and each time one structured data is generated, the corresponding structured data can be recorded. Historical structured data is all of the structured data that has been generated that was recorded prior to the current structured data being generated. When the current structured data is judged to be incomplete, the historical structured data is traversed in a reverse order, the complete structured data in the historical structured data is searched, and the found first complete historical structured data is used as the upper text to be matched for being used when a state model is matched.

Step S350, calling the current structured data as the current context.

If the current structured data is determined to be incomplete structured data, the current structured data can be used as a current context for use in matching the state model.

And step S360, circularly matching the state model, and judging whether the context to be matched and the current context accord with a preset state mode.

And if the current structured data which is judged to be incomplete is taken as the current context, judging whether the state condition of the preset state mode is met according to the context to be matched and the current context, and determining whether the preset state mode is met.

In the system shown in fig. 1, a state machine may be defined, and the state machine may preset a plurality of state modes, each state mode having a certain condition. And when the context to be matched and the current context meet the condition of a preset state mode, determining that the preset state mode is met.

For example, a state machine is defined in which three different preset state modes are provided, and the conditions of each preset state mode are different, see table 1.

TABLE 1

The state model is defined by a plurality of different states, namely a plurality of preset state modes; the above to be matched is the first complete structured data found by inquiring the historical structured data and traversing the historical structured data in a reverse order; the current context is the current structured data that is determined to be incomplete. The condition is that the context to be matched and the current context meet the condition which needs to be met by the preset state mode, and when the context to be matched and the current context meet all the conditions, the preset state mode is considered to be met.

For example, the above to be matched is "turn on air conditioner", and the present below is "turn off". According to the method described in the above embodiment, then the following structured data is currently:

{Intent：control，

Position:null，

action: { name: 'off', type: 'off', value: null }

Then, the value of Object in the current context is null, and there is no valid device information in the current context, that is, the target device is absent; if the value of Action in the current context is "closed", then there is valid Action information in the current context, i.e. there is interaction information "closed".

The structured data to be matched to the above is:

{Intent：control，

Position:null，

action: { name: 'open', type: 'on', value: null }

If the Object value is "air conditioner" in the to-be-matched context, valid device information exists in the to-be-matched context, that is, the target device exists; if the value of the Action to be matched is "on", the effective Action information exists in the to-be-matched text, that is, the interactive information "on" exists, and the "air conditioner" can execute the Action of "off".

It can be concluded from this that, if there is no target device in the current context, there is an interactive action "close" in the context, and the target device "air conditioner" in the context can perform the interactive action "close" in the context, it is considered that the condition of the preset state mode state 1 is satisfied, and the state 1 is satisfied: the same objects are omitted. If the condition of the state 1 is not met, other preset state modes such as a state 2 and a state 3 are compared in sequence.

And step S370, if the matching result is satisfied, matching the context to be matched with the current context to form complete target structured data.

And when the context to be matched and the current context accord with a preset state mode, combining the context to be matched and the current context to form complete target structured data. For example, the above to be matched is "turn on the air conditioner", the current following is "turn off", and after the preset state pattern is circularly matched, the state 1 is met. The above "turn on air conditioner" to be matched and the current context "turn off" are merged to form complete target structured data. The complete target structured data formed is:

{Intent：’control’，

Position:null，

action: { name: 'off', type: 'off', value: null }

And step S380, executing action according to the target structured data.

By means of state model matching, incomplete current structured data are converted into complete executable target structured data, and effective action information can control equipment corresponding to the effective equipment information to execute actions corresponding to the effective action information according to the effective equipment information in the target structured data.

According to the information processing method, the incomplete structured data is combined with the historical structured data to match the state model, complete target structured data is generated, each piece of acquired information is fully utilized, understanding of voice is enhanced, and conversation is natural and smooth.

Referring to fig. 6, a further embodiment of the present application provides a method for processing voice information, where the embodiment focuses on the process of performing actions according to the formed complete target structured data, and the method may include:

step S410, converting the acquired voice information into text information.

Step S420, processing the text information, and generating and recording current structured data.

Step S430, determine whether the current structured data is complete.

Step S440, if the current structured data is incomplete, performing state model matching on the current structured data and historical structured data to form complete target structured data, and if the current structured data is complete, taking the current structured data as the target structured data.

The steps S410 to S440 can refer to the corresponding parts of the previous embodiments, and are not described herein again.

And S450, controlling the equipment to execute actions according to the target structured data.

And controlling the equipment corresponding to the effective equipment information to execute the action corresponding to the effective action information according to the effective equipment information and the effective action information in the target structuring.

And step S460, receiving the action execution result returned by the equipment.

And after controlling the equipment to execute the action according to the effective equipment information and the effective action information in the target structured data, receiving an action execution result returned by the equipment.

When the execution main body is the electronic device or the gateway, the cloud platform can be requested according to the effective device information and the effective action information in the target structured data, the cloud platform interface is called, and the cloud platform is instructed to control the device corresponding to the effective device information to execute the action corresponding to the effective action information according to the effective action information and the effective device information. For example, the effective device information of the target structured data is an air conditioner, the effective action information is opening, the air conditioner and the opening can be sent to a gateway connected with the air conditioner through a cloud platform, the gateway sends an action instruction to the air conditioner to control the air conditioner to be opened, an action execution result can be fed back to the gateway after the air conditioner is successfully opened, and the gateway feeds back the action execution result through the cloud platform. When the execution subject is the server, the air conditioner can be controlled to be opened directly according to the effective equipment information 'air conditioner' and the effective action information 'opening' in the target structured data, and the action execution result returned by the air conditioner can be directly received.

Step S470, matching a corresponding preset reply template according to the target structured data and the action execution result, and generating a reply text.

After the action execution result is received, the reply text can be generated by matching the corresponding reply template with the target structured data and the action execution result. For example, the structured data is

{Intent：’control’，

Position:null，

action: { name: 'open', type: 'on', value: null } }.

And the action execution result returned by the air conditioner is successful.

Ac _ state for action: "good, will [ object _ name ] [ action _ name ]". Then the reply text generated by filling the corresponding object _ name and action _ name into the reply template is "good, the air conditioner is turned on".

After generating the reply text, the reply text may be converted to corresponding voice information using voice synthesis techniques. And outputting the voice information through an interface for inputting the voice information by the user. For example, if the user is a voice message sent by the voice assistant through a client in the electronic device, the voice message generated corresponding to the reply text is output by the voice assistant.

The voice information processing method comprises the steps of converting voice information into text information, generating structured data through processing of the text information, judging whether the structured data are complete or not, carrying out state model matching if the structured data are incomplete, combining a text to be matched with a current text if the structured data accord with a preset state mode, generating complete target structured data, controlling equipment to execute target actions according to the target structured data, receiving execution results of the actions, combining the action execution results with the target structured data, matching corresponding reply templates, generating reply texts, and converting the reply texts into voice output. The method and the device have the advantages that when the user omits partial information to carry out spoken voice input, the user can still understand the content which the user wants to express and respond to the instruction of the user, the language of the user can be more natural and smooth when people have conversation, and the experience of the user in using the voice control device is improved.

Referring to fig. 7, a speech information processing apparatus 500 according to an embodiment of the present application is shown, where the apparatus 500 includes a conversion module 510, a preprocessing module 520, a determination module 530, a processing module 540, and a response module 550.

A conversion module 510, configured to convert the acquired voice information into text information;

the preprocessing module 520 is configured to process the text information, generate and record current structured data;

a judging module 530, configured to judge whether the current structured data is complete;

the processing module 540 is configured to perform state model matching on the incomplete current structured data and the historical structured data to form complete target structured data, and use the complete current structured data as the target structured data;

a response module 550 for performing an action according to the target structured data. Referring to fig. 8, a block diagram of a response module 550 according to an embodiment of the present disclosure is shown.

Further, the answering module 550 comprises an executing unit 551, a receiving unit 552 and a replying unit 553. The execution unit 551 is used for controlling the equipment to execute actions according to the target structured data; the receiving unit 552 is configured to receive an action execution result returned by the device; the reply unit 553 is configured to match a corresponding preset reply template according to the target structured data and the action execution result, and generate a reply text.

Further, the reply unit 553 is further configured to match a corresponding preset reply template according to the target structured data and the action execution result; and filling effective equipment information and effective action information in the target structured data into the preset reply template to generate a reply text.

Further, the conversion module 510 is further configured to convert the reply text into speech.

Further, the preprocessing module 520 is further configured to perform intent classification on the text information, and perform word segmentation on the text information after intent classification; performing part-of-speech tagging and named entity recognition on the text information after word segmentation; and mapping and analyzing the results of the part of speech tagging and the named entity recognition based on template matching to generate the structural data under the intention.

Further, the determining module 530 is further configured to determine whether the intention of the text message is clear, and if not, the processing module 540 prompts the user to express the clear intention.

Further, the determining module 530 is further configured to determine whether the text message with the clear intention is an interactive control command, and if not, invoke a chat robot interface to directly generate a reply.

Further, the determining module 530 is further configured to determine whether the context to be matched and the current context meet a state condition of a preset state mode, and if there is an interaction in the current context, and there is no target device, and the target device in the context to be matched can execute the interaction in the current context, determine that the state condition is met; or if the current context has the interactive action and the position information and lacks the target equipment, and the target equipment with the similar position information in the context to be matched can execute the interactive action in the current context, judging that the state condition is met; or if the target device exists in the current context, the interaction action is lacked, and the target device in the current context can execute the interaction action in the context to be matched, judging that the state condition of the state mode is met.

Further, when the determining module 530 determines that the context to be matched and the current context satisfy the state condition of the preset state pattern, the processing module 540 traverses the historical structured data in a reverse order, and calls the found first complete structured data as the context to be matched; calling the current structured data as a current context; circularly matching the state model, and judging whether the context to be matched and the current context accord with a preset state mode; and if so, combining the context to be matched and the current context to form complete target structured data.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

To sum up, according to the voice information processing method provided by the application, the acquired voice information is processed, each structured data is generated and recorded for next round calling, whether the generated structured data is complete or not is judged, if the generated structured data is incomplete, matching of a state model is performed on the current structured data to form complete target structured data, corresponding actions are executed according to information in the target structured data, and if the current structured data is complete, the corresponding actions are directly executed according to the current structured data. Through the information processing to the pronunciation, fully keep and utilize the information of every round of dialogue of user, the user need not every round all provide complete information when having a conversation for the language of dialogue is more natural and smooth, can shorten interactive time, promotes interactive experience.

In the several embodiments provided in the present application, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be in an electrical, mechanical or other form.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

The embodiment of the application provides a structural block diagram of an electronic device. Referring to fig. 9, a block diagram of an electronic device according to an embodiment of the present application is shown. The electronic device 100 may be a smart phone, a tablet computer, an electronic book, or other electronic devices capable of running an application. The electronic device 100 in the present application may include one or more of the following components: a processor 101, a memory 102, and one or more applications, wherein the one or more applications may be stored in the memory 102 and configured to be executed by the one or more processors 101, the one or more programs configured to perform the methods as described in the aforementioned method embodiments.

Processor 101 may include one or more processing cores. The processor 101 connects various parts within the overall electronic device 100 using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 102 and calling data stored in the memory 102. Alternatively, the processor 101 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 101 may integrate one or a combination of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 101, but may be implemented by a communication chip.

The Memory 102 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 102 may be used to store instructions, programs, code sets, or instruction sets. The memory 102 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The data storage area may also store data created by the electronic device 100 during use (e.g., phone book, audio-video data, chat log data), and the like.

Referring to fig. 10, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable storage medium 600 has stored therein program code that can be called by a processor to execute the method described in the above-described method embodiments.

The computer-readable storage medium 600 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 600 includes a non-transitory computer-readable storage medium. The computer readable storage medium 600 has storage space for program code 610 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. The program code 610 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for processing speech information, the method comprising:

converting the acquired voice information into text information;

processing the text information, and generating and recording current structured data;

judging whether the current structured data is complete;

if the current structured data is incomplete, performing state model matching on the current structured data and historical structured data to form complete target structured data, and if the current structured data is complete, taking the current structured data as the target structured data;

and executing actions according to the target structured data.

2. The method of claim 1, wherein said determining whether the current structured data is complete comprises:

detecting whether the current structured data contains effective equipment information and effective action information;

and if the valid equipment information and the valid action information are contained, judging that the current structured data is complete.

3. The method of claim 1 or 2, wherein if the current structured data is incomplete, performing state model matching with historical structured data to form complete target structured data, comprising:

traversing the historical structured data in a reverse order, and calling the found first complete structured data as the text to be matched;

calling the current structured data as a current context;

circularly matching the state model, and judging whether the context to be matched and the current context accord with a preset state mode;

and if so, matching the context to be matched with the current context to form complete target structured data.

4. The method of claim 3, wherein circularly matching the state model and determining whether the context to be matched and the current context conform to a predetermined state pattern comprises:

judging whether the context to be matched and the current context meet the state condition of the preset state mode;

and if the state condition is met, judging that the preset state mode is met.

5. The method of claim 4, wherein determining whether the context to be matched and the current context satisfy the state condition of the preset state pattern comprises:

if the current context has an interactive action, a target device is lacked, and the target device to be matched with the context can execute the interactive action in the current context, judging that the state condition is met; or

If the current context has the interactive action and the position information, the target equipment is lacked, and the target equipment with the similar position information to be matched can execute the interactive action in the current context, the state condition is judged to be met; or

And if the target equipment exists in the current context, the interaction action is lacked, and the target equipment in the current context can execute the interaction action in the context to be matched, judging that the state condition is met.

6. The method of claim 3, wherein said forming complete target structured data comprises:

and combining the context to be matched and the current context to form complete target structured data.

7. The method of claim 1, wherein said performing an action based on said target structured data comprises:

controlling the equipment to execute actions according to the target structured data;

receiving an action execution result returned by the equipment;

and matching a corresponding preset reply template according to the target structured data and the action execution result to generate a reply text.

8. The method of claim 7, wherein the performing an action based on the target structured data further comprises:

and converting the reply text into voice output.

9. The method according to claim 7 or 8, wherein matching a corresponding preset reply template according to the target structured data and the action execution result to generate a reply text comprises:

matching a corresponding preset reply template according to the target structured data and the action execution result;

and filling effective equipment information and effective action information in the target structured data into the preset reply template to generate a reply text.

10. The method of claim 1, wherein said processing the textual information to generate and record current structured data comprises:

classifying the text information intently, and judging whether the text information is an interactive control command;

and if the text information is the interactive control command, performing semantic analysis on the text information to generate structured data corresponding to the intention classification.

11. The method of claim 10, wherein classifying the text information for intent to determine whether it is an interactive control command comprises:

judging whether the intention of the text information is clear or not;

if the intention of the text information is clear, judging whether the text information is an interactive control command;

and if the intention of the text information is not clear, prompting the user to express a clear intention.

12. The method of claim 11, wherein if the text message is explicitly intended, determining whether the text message is an interactive control command comprises:

and if the command is not the interactive control command, calling the chat robot interface to directly generate a reply.

13. The method of claim 10, wherein the semantically parsing the text to generate structured data corresponding to the intent classification comprises:

segmenting the text information;

performing part-of-speech tagging and named entity recognition on the text information after word segmentation;

and mapping and analyzing the results of the part of speech tagging and the named entity recognition based on template matching to generate the structural data under the intention.

14. The method of claim 13, wherein the step of performing template matching-based mapping analysis on the results of the part-of-speech tagging and the named body recognition to generate the structured data under the intention comprises:

and combining the results of the part of speech tagging and the named entity recognition with a word list template to generate structured data corresponding to the intention classification, wherein the word list template is a template for describing the corresponding relation of equipment and operation attributes under the intention.

15. A speech information processing apparatus characterized by comprising:

the conversion module is used for converting the acquired voice information into text information;

the preprocessing module is used for processing the text information, and generating and recording current structured data;

the judging module is used for judging whether the current structured data is complete or not;

the processing module is used for carrying out state model matching on the incomplete current structured data and the historical structured data to form complete target structured data, and taking the complete current structured data as the target structured data;

and the response module is used for executing actions according to the target structured data.

16. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-14.

17. A computer-readable storage medium having program code stored therein, the program code being invoked by a processor to perform the method of any one of claims 1 to 14.