CN111428512B

CN111428512B - Semantic recognition method, device and equipment

Info

Publication number: CN111428512B
Application number: CN202010232031.5A
Authority: CN
Inventors: 王夏鸣
Original assignee: Volkswagen Mobvoi Beijing Information Technology Co Ltd
Current assignee: Volkswagen Mobvoi Beijing Information Technology Co Ltd
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2023-12-12
Anticipated expiration: 2040-03-27
Also published as: CN111428512A

Abstract

The embodiment of the invention discloses a semantic recognition method, a semantic recognition device and semantic recognition equipment. The semantic recognition method comprises the following steps: the client acquires a target identity of a user and target scene information matched with a target voice instruction according to the target voice instruction input by the user; the client determines a target scene semantic recognition model matched with the target scene information in at least one scene semantic recognition model matched with the target identity; and the client identifies the target voice instruction according to the target scene semantic identification model. According to the technical scheme, the semantic recognition is carried out on the client by using the scene semantic recognition model matched with the application scene, so that the efficiency of the semantic recognition is improved.

Description

Semantic recognition method, device and equipment

Technical Field

The embodiment of the invention relates to a computer technology, in particular to a semantic recognition method, a semantic recognition device and semantic recognition equipment.

Background

At present, a vehicle-mounted voice recognition system is often used in the process that a driver drives a motor vehicle, the driver can directly send a voice command to control the vehicle-mounted motor vehicle without manual operation, convenience is brought to the driver, and meanwhile, some defects are gradually revealed, for example, the voice of the vehicle-mounted voice recognition system from the voice command input by a user to the feedback of the vehicle-mounted motor vehicle is usually more than 2 seconds, even under the condition that the voice command is more complex, the waiting feedback time is up to 5-6 seconds, so that the driver can input more efforts to wait for the feedback result of the system, and the response speed of the driver to an emergency is influenced.

In the prior art, because the computing power and the storage space of the local system are limited, the complete recognition model cannot be carried, and only a smaller recognition model can be adopted to complete tasks which can be processed locally and at a higher frequency, such as making a call, etc., in order to improve the response speed of the system, a cloud+end voice recognition system is generally adopted, namely, a cloud system is responsible for most of voice recognition and semantic understanding, and a local terminal system is responsible for a small number of functions such as telephone, car control, etc. However, the identification process of the cloud system involves a relatively long time-consuming network transmission process, so feedback of the cloud system is usually slower than feedback of the local system, but the local terminal system cannot complete more voice tasks due to limitation of calculation power and storage space.

Disclosure of Invention

The embodiment of the invention provides a semantic recognition method, a semantic recognition device and semantic recognition equipment, which are used for carrying out semantic recognition on a client by using a scene semantic recognition model matched with an application scene, so that the efficiency of semantic recognition is improved.

In a first aspect, an embodiment of the present invention provides a semantic recognition method, where the method includes:

the client acquires a target identity of a user and target scene information matched with a target voice instruction according to the target voice instruction input by the user;

The client determines a target scene semantic recognition model matched with the target scene information in at least one scene semantic recognition model matched with the target identity;

and the client identifies the target voice instruction according to the target scene semantic identification model.

In a second aspect, an embodiment of the present invention provides a semantic recognition method, where the method includes:

the method comprises the steps that a server acquires training data sent by a target client in real time, and groups the training data according to identity marks included in the training data;

when the server determines that a target group corresponding to the target identity meets a model generation condition, calculating semantic features under at least one scene dimension according to historical scene information and historical voice instructions included in each training data in the target group;

and the server determines at least one scene semantic identification model according to the semantic features in each scene dimension, and transmits the scene semantic identification model corresponding to the target identity to the target client.

In a third aspect, an embodiment of the present invention further provides a semantic recognition apparatus, where the apparatus includes:

The scene information acquisition module is used for acquiring a target identity of a user and target scene information matched with the target voice instruction according to the target voice instruction input by the user by the client;

the identification model determining module is used for determining a target scene semantic identification model matched with the target scene information in at least one scene semantic identification model matched with the target identity by the client;

and the voice information recognition module is used for the client to recognize the target voice instruction according to the target scene semantic recognition model.

In a fourth aspect, an embodiment of the present invention further provides a semantic recognition apparatus, where the apparatus includes:

the training data acquisition module is used for acquiring training data sent by the target client in real time by the server and grouping the training data according to the identity mark included in the training data;

the semantic feature calculation module is used for calculating semantic features under at least one scene dimension according to historical scene information and historical voice instructions included in each training data in the target group when the server determines that the target group corresponding to the target identity meets the model generation condition;

The identification model determining module is used for determining at least one scene semantic identification model according to semantic features in each scene dimension by the server and transmitting the scene semantic identification model corresponding to the target identity to the target client.

In a fifth aspect, an embodiment of the present invention further provides an electronic device, including:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the semantic recognition method provided by any embodiment of the present invention.

In a sixth aspect, an embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the semantic recognition method provided by any embodiment of the present invention.

According to the technical scheme of the embodiment of the invention, the client acquires the target identity of the user and the target scene information matched with the target voice instruction according to the target voice instruction input by the user, then determines the target scene semantic recognition model matched with the target scene information in at least one scene semantic recognition model matched with the target identity, finally recognizes the target voice instruction according to the target scene semantic recognition model, solves the problems that the semantic recognition is slower due to the fact that the semantic recognition mode combined by the client and the server is adopted due to limited calculation force and storage space and the fact that the semantic recognition is slower due to long time consumption in the data transmission process in the prior art, and improves the semantic recognition efficiency.

Drawings

FIG. 1 is a flow chart of a semantic recognition method according to a first embodiment of the present invention;

FIG. 2 is a flow chart of a semantic recognition method according to a second embodiment of the present invention;

FIG. 3 is a flow chart of a semantic recognition method according to a third embodiment of the present invention;

FIG. 4 is a flow chart of a semantic recognition method according to a fourth embodiment of the present invention;

FIG. 5 is a schematic diagram of a semantic recognition device according to a fifth embodiment of the present invention;

FIG. 6 is a schematic diagram of a semantic recognition device according to a sixth embodiment of the present invention;

fig. 7 is a schematic structural diagram of an apparatus according to a seventh embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Example 1

Fig. 1 is a flowchart of a semantic recognition method according to a first embodiment of the present invention, where the technical solution of the present embodiment is applicable to a case where a client performs semantic recognition through a scene semantic recognition model, and the method may be performed by a semantic recognition device, where the device may be implemented by software and/or hardware, and may be integrated in various general-purpose computer devices, and specifically includes the following steps:

Step 110, the client obtains the target identity of the user and the target scene information matched with the target voice command according to the target voice command input by the user.

The target voice command is a voice command which is input by a user and needs to be subjected to semantic recognition, and the target voice command can be 'play a piece of soothing music'; the target identity is an identity capable of characterizing a unique user, e.g., the target identity may be a user ID corresponding to current user identity information; the target scene information is a scene in which the user inputs a target voice instruction, and may include a current time and a current position, for example.

In this embodiment, the client obtains a target voice command input by the user, determines an identity of the user and target scene information matched with the target voice command according to the target voice command, specifically, the client obtains a sound feature corresponding to the target voice command, determines a target identity according to a corresponding relationship between the sound feature and the identity of the user, and simultaneously obtains time information and position information when the user inputs the voice command.

The client side firstly obtains the target voice command, extracts the voiceprint information of the target voice command, searches the target identity corresponding to the current voiceprint information from the comparison relation between the prestored voiceprint information and the user identity, obtains the time information of receiving the target voice command through the clock of the client side, and determines the position information when the target voice command is received through the GPS (global positioning system ) of the client side.

Step 120, the client determines a target scene semantic recognition model matched with the target scene information in at least one scene semantic recognition model matched with the target identity.

The scene semantic recognition models are semantic recognition models matched with specific scenes, each scene semantic recognition model corresponds to at least one application scene, and the application scene can comprise time information and position information.

In this embodiment, after acquiring a target identity of a user and target scene information matched with a target voice instruction, a client first queries at least one locally stored scene semantic recognition model corresponding to a current target identity, and then searches for a target semantic recognition model matched with the current target scene information in the at least one scene semantic recognition model.

Exemplary, the client pre-stores the information associated with 5 pm: 30-6:00 and a scene semantic recognition model A corresponding to a GPS coordinate area of a company, wherein the scene semantic recognition model corresponds to an identity of a user A, and when the user A sends a voice instruction to a client, the client detects that the current time is 5:30-6:00, and the current position is in the GPS coordinate area of the company, determining the scene semantic recognition model A as the target scene semantic recognition model.

And 130, the client identifies the target voice instruction according to the target scene semantic identification model.

In this embodiment, after determining a target scene semantic recognition model that matches the target scene information, the client performs recognition processing on the currently received voice command using the target scene semantic recognition model. For example, the target voice command input by the user a is "query nearby restaurant", the client determines, in at least one scene semantic recognition model corresponding to the identity of the user a, a scene semantic recognition model a matching with the scene information when the user a inputs the target voice command, and finally uses the scene Jing Yuyi recognition model a to perform semantic recognition on the target voice command.

Optionally, when the client detects that the target user logs in the client for the first time, guiding the target user to perform voiceprint registration, and storing voiceprint information of the target user and the identity of the target user correspondingly;

the voiceprint information is used for identifying the identity of the user inputting the voice command.

In this optional embodiment, specific operations that the client needs to perform when detecting that the target user logs in the client for the first time are provided, including guiding the target user to perform voiceprint registration, and storing voiceprint information of the target user and an identity of the target user locally, so that in a subsequent use, the client can determine the identity of the user according to a target voice command input by the target user.

For example, when the client detects that the current user is the first-time login client, the user is guided to input the set voice information, voiceprint recognition is performed on the voice information input by the user, voiceprint information of the current user is extracted, and the voiceprint information is bound with an ID (identity) of the user when the user logs in.

According to the technical scheme of the embodiment of the invention, the client acquires the target identity of the user and the target scene information matched with the target voice instruction according to the target voice instruction input by the user, then determines the target scene semantic recognition model matched with the target scene information in at least one scene semantic recognition model matched with the target identity, finally recognizes the target voice instruction according to the target scene semantic recognition model, and solves the problems that the semantic recognition is slower due to long time consumption in the data transmission process in the semantic recognition mode combined with the client and the server due to limited calculation force and storage space of the client and the fact that the client cannot cover more voice tasks in the prior art, so that the semantic recognition efficiency is improved and the driving safety is improved.

Example two

Fig. 2 is a flowchart of a semantic recognition method according to a second embodiment of the present invention, where the embodiment is further refined based on the foregoing embodiment, and provides specific steps before a client obtains a target identity of a user and target scene information matched with the target voice instruction according to the target voice instruction input by the user, and specific steps after the client performs recognition processing on the target voice instruction according to a target scene semantic recognition model. The following describes a semantic recognition method according to a second embodiment of the present invention with reference to fig. 2, including the following steps:

step 210, the client obtains the target identity of the user and the target scene information matched with the target voice command according to the target voice command input by the user.

Optionally, before the client obtains the target identity of the user and the target scene information matched with the target voice command according to the target voice command input by the user, the method further includes:

the client collects at least one historical voice instruction input by a user, the identity of the user and historical scene information matched with the historical voice instruction in real time as training data;

Uploading the training data to a server in real time, so that the server generates at least one scene semantic recognition model corresponding to each identity according to the training data;

and receiving at least one scene semantic recognition model which is issued by the server and corresponds to each identity mark.

In this optional embodiment, specific operations are provided before the client obtains the target identity of the user and the target scene information matched with the target voice instruction according to the target voice instruction input by the user, firstly, the client collects the historical voice instruction input by the user, the identity of the user and the historical scene information matched with the historical voice instruction in real time as training data, and uploads the training data to the server, so that the server determines at least one scene semantic recognition model corresponding to the identity of the current user through the training data, and finally the client receives the at least one scene semantic recognition model corresponding to the identity issued by the server, and can select the scene semantic recognition model matched with the current scene information to perform semantic recognition on the voice instruction after receiving the voice instruction input by the user subsequently.

Step 220, the client determines a target scene semantic recognition model matched with the target scene information in at least one scene semantic recognition model matched with the target identity.

And 230, the client identifies the target voice instruction according to the target scene semantic identification model.

Step 240, the client uploads the target voice command, the target identity and the target scene information of the user to the server, so that the server updates at least one scene semantic recognition model matched with the target identity.

In this embodiment, in order to ensure that the scene semantic recognition model stored by the client is highly matched with the use state of the user, the scene semantic recognition model needs to be updated according to the use state of the user, specifically, the client uploads the target voice command, the target identity and the target scene information input by the user to the server in real time as training data, and when the number of the training data reaches a set threshold, the server updates at least one scene semantic recognition model corresponding to the target identity uploaded currently, so that the scene semantic recognition model can be matched with the recent use habit of the user.

According to the technical scheme, the client uploads the historical voice command, the user identity mark and the historical scene information matched with the historical voice command as training data to the server, so that the server generates at least one scene semantic recognition model corresponding to each identity mark according to the training data, then the client performs semantic recognition on the target voice command input by the user according to the scene semantic model issued by the server, and uploads the target voice command, the target identity mark and the target scene information as training data to the server after the semantic recognition, so that the server updates the scene semantic recognition model, the scene semantic recognition model is updated according to the use condition of the user, each scene semantic recognition model is more in line with the actual use scene of the user, and the recognition success rate of the scene semantic recognition model is improved.

Example III

Fig. 3 is a flowchart of a semantic recognition method in a third embodiment of the present invention, where the technical solution of the present embodiment is suitable for a case where a server determines a scene semantic recognition model according to training data sent by a client, where the method may be performed by a semantic recognition device, and the device may be implemented by software and/or hardware and may be integrated in various general purpose computer devices, and specifically includes the following steps:

Step 310, the server acquires the training data sent by the target client in real time, and groups each training data according to the identity mark included in the training data.

The training data is data sent by the target client and used for training a scene semantic recognition model, and the training data comprises a historical voice instruction, an identity mark of a user and historical scene information matched with the historical voice instruction.

In this embodiment, the server acquires the training data sent by the target client in real time, and groups the training data according to the identity identifier included in the training data, so as to establish a set of corresponding scene semantic recognition models for each user. The server, after acquiring the training data sent by the target client, first determines the user ID corresponding to the training data, and then groups the training data according to the user ID.

Step 320, when the server determines that the target packet corresponding to the target identity meets the model generation condition, calculating semantic features in at least one scene dimension according to the historical scene information and the historical voice command included in each training data in the target packet.

The semantic features are features for identifying voice instruction use habits of a user in a specific dimension or a specific use scene, and exemplary semantic features may be voice instruction use habits of the user in time dimensions in each set period, where the voice instruction use habits refer to semantic categories corresponding to voice instructions input by the user, for example, in the afternoon of 5:30-6:00, and the semantic categories with the largest number of user use times are navigation and playing music, so that the navigation and playing music together form the semantic features of the time dimension.

In this embodiment, when determining that a target packet corresponding to a target identity meets a model generation condition, the server calculates semantic features in at least one scene dimension according to historical scene information and historical voice instructions contained in training data in the target packet, specifically, identifies semantic categories corresponding to the historical voice instructions, and then determines the semantic features in at least one scene dimension according to the semantic categories.

After detecting that training data in a target group reaches a set threshold, for example, 100 pieces of training data, and determining that the target group corresponding to a target identity meets a model generation condition, firstly identifying a semantic type corresponding to each historical voice instruction, for example, searching a restaurant as a semantic type, playing music as a semantic type, counting the times of using each semantic type in each time period every day, sorting the semantic types according to the times of using from high to low, and taking the semantic type with the first 5 times of using as a semantic feature corresponding to the time period counted currently; or dividing the areas of all the positions of the voice command used by the user, counting the times of using each semantic category in each area, sequencing the semantic categories according to the times of using from high to low, and taking the semantic category with the previous 5 times of using as the semantic feature corresponding to the area counted currently.

Step 330, the server determines at least one scene semantic recognition model according to the semantic features in each scene dimension, and issues the scene semantic recognition model corresponding to the target identity to the target client.

In this embodiment, according to the semantic features in each scene dimension, a scene semantic recognition model corresponding to each application scene is determined, and the scene semantic recognition model corresponding to the target identity is issued to the target client. For example, 24 hours per day are divided into 24 time periods, semantic features of each time period are obtained, all positions where voice instructions are used by users are divided into a plurality of areas, after the semantic features of each area are obtained, the semantic features of each time period and the semantic features corresponding to each area are compared in pairs, and if the semantic features of 5:00-6:00 pm are found to be identical to the semantic features of the area 1, semantic recognition models for recognizing the two semantic features are packaged and used as scene semantic recognition models corresponding to the 5:00-6:00 and the area 1.

According to the technical scheme, the server acquires training data sent by the target client in real time, groups each training data according to the identity mark contained in the training data, calculates semantic features in at least one scene dimension according to historical scene information and historical voice instructions contained in each training data in the target group when the target group corresponding to the target identity mark is determined to meet the model generation condition, and finally determines at least one scene semantic recognition model according to the semantic features in each scene dimension, and sends the scene semantic recognition model corresponding to the target identity mark to the target client.

Example IV

Fig. 4 is a flowchart of a semantic recognition method in a fourth embodiment of the present invention, where the embodiment is further refined on the basis of the foregoing embodiment, and provides a specific step in which, when the server determines that a target packet corresponding to a target identity meets a model generation condition, the server calculates semantic features in at least one scene dimension according to historical scene information and historical voice instructions included in each training data in the target packet, and a specific step in which the server determines at least one scene semantic recognition model according to the semantic features in each scene dimension, and issues the scene semantic recognition model corresponding to the target identity to a target client. The following describes a semantic recognition method according to the fourth embodiment of the present invention with reference to fig. 4, which includes the following steps:

step 410, the server acquires the training data sent by the target client in real time, and groups each training data according to the identity mark included in the training data.

Step 420, when the server determines that the training data in the target packet reaches the set threshold, dividing the time information in the historical scene information into a set number of time periods, and counting the semantic categories corresponding to the historical voice instructions in each time period and the occurrence times of each semantic category, so as to obtain a first semantic feature set in the time dimension.

The first semantic feature set is composed of semantic features corresponding to each time period, and the semantic features are composed of a plurality of semantic categories appearing in the set time period. For example, the first semantic feature set is composed of 24 semantic features corresponding to 24 time periods respectively, and the semantic features corresponding to 2:00-3:00 pm are composed of 5 semantic categories appearing between 2:00-3:00 pm, wherein the 5 semantic categories can select semantic categories positioned at the first 5 semantic categories in the ordering according to the occurrence number.

In this embodiment, a manner is provided for obtaining a first semantic feature set in a time dimension, where when training data in a target packet received by a server reaches a set threshold, time information in historical scene information is divided into a plurality of time periods, then semantic categories corresponding to historical voice instructions occurring in each time period are determined, occurrence times of each semantic category in each time period are counted, and the first semantic feature set in the time dimension is determined according to a counted result.

When training data in a target packet received by a server reaches 100 pieces, time information in historical scene information is divided into 24 time periods, specifically, the time information can be divided according to 24 hours of a day, namely, 1:00-2:00,2:00-3:00, …,23:00-00:00 are respectively used as 24 time periods, semantic categories corresponding to historical voice instructions appearing in each time period are determined, specifically, hotels are searched as a class, routes are searched as a class, the number of times of appearance of each semantic category in the current time period is counted after the semantic categories appearing in each time period are determined, the semantic categories are ranked according to the number of appearance times from more to less, the semantic category 5 before ranking is used as semantic features corresponding to the current time period, and finally 24 semantic features corresponding to the 24 time periods jointly form a first semantic feature set in a time dimension.

And 430, dividing the position information in the historical scene information into a set number of areas, counting semantic categories corresponding to the historical voice instructions in each area and the occurrence times of each semantic category, and obtaining a second semantic feature set in the space dimension.

The second semantic feature set is composed of semantic features corresponding to the regions, and the semantic features are composed of a plurality of semantic categories appearing in the set region. The second semantic feature set is formed by dividing the location information into 20 semantic features corresponding to 20 regions respectively, and the semantic feature corresponding to the region 1 is formed by 5 semantic categories appearing in the region 1, wherein the 5 semantic categories can select semantic categories positioned in the first 5 semantic categories in the order according to the number of appearance.

In this embodiment, a manner is provided for obtaining a second semantic feature set in a spatial dimension, where when training data in a target packet received by a server reaches a set threshold, position information in historical scene information is divided into a plurality of regions, then semantic categories corresponding to historical voice instructions occurring in each region are determined, occurrence times of each semantic category in each region are counted, and the second semantic feature set in the spatial dimension is determined according to a counted result.

When training data in a target packet received by a server reaches 100 pieces, dividing the position information in the historical scene information into m areas, specifically, dividing a minimum rectangle containing all the position information in the historical scene information into a plurality of areas according to the size of 1 km by 1 km, numbering each area, determining a semantic category corresponding to a historical voice instruction appearing in each area, specifically, searching a hotel as a class, searching a route as a class, and the like, counting the number of times of appearance of each semantic category in a current area after determining the semantic category appearing in each area, sorting the semantic categories according to the number of appearance, taking the semantic category of 5 before ranking as semantic features corresponding to the current area, and finally forming a second semantic feature set in space dimension by m semantic features corresponding to the m areas together.

Step 440, combining each time period with each region in pairs to obtain semantic categories corresponding to the historical voice instructions in the set time period and the set region and occurrence times of each semantic category, and obtaining a third semantic feature set;

the first semantic feature set, the second semantic feature set and the third semantic feature set are all composed of at least one semantic feature, and the semantic features are composed of a set number of semantic categories.

Wherein the third semantic feature set is composed of a plurality of semantic features corresponding to a combination of a plurality of time periods and regions, the semantic features being composed of a plurality of semantic categories occurring within a set time period and a set region. The third semantic feature set is formed by combining 24 time periods and 20 areas in pairs, counting the occurrence times of each semantic category for each combination, obtaining semantic features corresponding to each combination, and finally forming the third semantic feature set by each semantic feature.

In this embodiment, a manner of acquiring a third semantic feature set in time and space dimensions is provided, when training data in a target packet received by a server reaches a set threshold, time information and position information in historical scene information are divided into a plurality of time periods and a plurality of regions, each time period and each region are combined in pairs, then semantic categories corresponding to historical voice instructions appearing in each combination are determined, the number of occurrences of each semantic category corresponding to each combination is counted, and the third semantic feature set in time and space dimensions is determined according to a counted result.

When training data in a target packet received by a server reaches 100 pieces, dividing time information in historical scene information into 24 time periods, dividing position information into m areas, combining each time period and each area in pairs to obtain 24 x m scene combinations, determining semantic categories corresponding to historical voice instructions appearing in each scene combination, counting the number of times that each semantic category appears in the current scene combination, sequencing the semantic categories according to the number of times of appearance, taking the semantic category of 5 in the ranking as semantic features corresponding to the current scene combination, and finally combining a plurality of semantic features corresponding to the scene combinations into a third semantic feature set in time and space dimensions.

Step 450, the first semantic features in the first semantic feature set and the second semantic features in the second semantic feature set are compared in pairs, whether the semantic categories contained in the first semantic feature set and the second semantic feature set are all the same is determined, if yes, step 460 is executed, otherwise step 470 is executed.

Wherein the first semantic feature and the second semantic feature are semantic features contained in the first semantic feature set and the second semantic feature set, respectively.

In this embodiment, in order to determine the finally issued scene semantic recognition model, at least one first semantic feature in the first semantic feature set and at least one second semantic feature in the second semantic feature set are compared in pairs, and whether all semantic categories contained in the first semantic feature set and the second semantic feature set are the same is determined, if yes, step 460 is executed, otherwise step 470 is executed.

Illustratively, the first set of semantic features includes the following 2:00-3: the first semantic features corresponding to the 00 time period include five semantic categories of querying a hotel, searching a route, playing music, making a call and sending a message, and the second semantic features corresponding to the region 2 included in the second semantic feature set also include the five semantic categories, and if it is determined that the semantic categories included in the second semantic feature set are all the same, step 460 is executed.

Step 460, issuing the semantic recognition model corresponding to the first semantic feature as a scene semantic recognition model to the target client.

In this embodiment, when the first semantic feature to be compared is the same as the semantic category included in the second semantic feature, the semantic recognition model corresponding to the first semantic feature is issued as a scene semantic model to the target client, and the application scene corresponding to the scene semantic recognition model is a time period corresponding to the first semantic feature and a region corresponding to the second semantic feature.

Illustratively, when compared to 2:00-3: if the first semantic features corresponding to the 00 time periods and the second semantic features corresponding to the region 2 contain the same semantic categories, determining a semantic recognition model corresponding to the first semantic features as a scene semantic recognition model, wherein the application scene of the model is 2:00-3:00 and region 2, i.e., the time when the user inputs a voice command is at 2:00-3:00, if the position is in the region 2, the semantic recognition is carried out on the voice command by using the scene semantic recognition model.

Step 470, determining a scene semantic recognition model according to the set rule, and issuing the scene semantic recognition model to a target client;

the application scene of the scene semantic model is a region corresponding to a time period and a second semantic feature corresponding to a first semantic feature.

In this embodiment, when the semantic categories included in the first semantic feature and the second semantic feature to be compared are not identical, a scene semantic recognition model needs to be determined according to a set rule and is issued to the target client.

Illustratively, when compared to 2:00-3: the first semantic features corresponding to the 00 time period and the second semantic features corresponding to the region 5 contain incomplete semantic categories, priorities can be set, a semantic recognition model corresponding to the semantic features with higher priorities is used as a scene semantic recognition model to be issued, for example, if the first semantic features with higher priorities in the time dimension are packaged, the semantic models corresponding to the first semantic features are issued to a target client, and the application scene of the scene semantic recognition model is 2:00-3:00 and region 5.

Optionally, determining a scene semantic recognition model according to a set rule, and issuing the scene semantic recognition model to the target client, including:

determining whether a third semantic feature matched with a time period corresponding to the first semantic feature and a region corresponding to a second semantic feature exists in the third semantic feature set;

if yes, the semantic recognition model corresponding to the third semantic features is used as a scene semantic recognition model to be issued to a target client;

If not, determining whether the first semantic features and the second semantic features are empty sets;

when the first semantic features or the second semantic features are empty sets, a semantic recognition model corresponding to the semantic features which are not empty sets is used as a scene semantic recognition model to be issued to a target client;

when the first semantic features and the second semantic features are empty sets, a default semantic recognition model is used as a scene semantic recognition model to issue a target client;

and when the first semantic features and the second semantic features are not empty sets, issuing a semantic recognition model corresponding to the semantic features with higher priority to a target client as a scene semantic recognition model.

In this optional embodiment, a specific manner of performing, when the semantic category included in the first semantic feature and the second semantic feature to be compared is not identical, separate processing according to the comparison situation is provided: determining whether a third semantic feature set contains a third semantic feature matched with a time period corresponding to the first semantic feature and a region corresponding to the second semantic feature, if so, directly issuing a semantic recognition model corresponding to the third semantic feature to a target client as a scene semantic recognition model, otherwise, further judging whether the first semantic feature and the second semantic feature are empty sets, namely judging whether the first semantic feature and the second semantic feature contain 0 semantic category, and if one of the first semantic feature and the second semantic feature is empty set, issuing the semantic recognition model corresponding to the semantic feature which is not empty set to the target client as the scene semantic recognition model; when the first semantic features and the second semantic features are empty sets, a default semantic recognition model is used as a scene semantic recognition model to issue a target client, wherein the default semantic recognition model can be a semantic recognition model stored in the target client in advance; when the first semantic features and the second semantic features are not empty sets, the semantic recognition model corresponding to the semantic features with higher priority is used as a scene semantic recognition model to be issued to the target client, for example, the semantic recognition model corresponding to the first semantic features is used as the scene semantic model to be issued to the target client when the first semantic features have higher priority. The application scene corresponding to each scene semantic recognition model is a time period corresponding to the first semantic feature and a region corresponding to the second semantic feature which are compared.

Optionally, before the scene semantic recognition model corresponding to the target identity is issued to the target client, the method further includes:

and comparing the scene semantic recognition model with the scene semantic recognition model issued to the target client, and if the scene semantic recognition model is the same as the scene semantic recognition model issued to the target client, not issuing the current scene semantic recognition model.

In this optional embodiment, a processing manner is provided before the scene semantic recognition model corresponding to the target identity is issued to the target client, the currently determined scene semantic recognition model is compared with at least one scene semantic recognition model already issued to the target client, and if it is determined that the model already exists in the target client, the currently determined scene semantic recognition model is not issued any more.

Illustratively, the current determination is with 2:00-3: the scene semantic recognition model corresponding to the 00 time period and the area 5 is the scene semantic recognition model 1, judging whether the model exists with the target client, if the scene semantic recognition model 1 exists in the target client, and the application scene is 12:00-13:00 time period and region 1, then the currently determined scene semantic recognition model 1 is not issued any more, but the application scene of the scene semantic recognition model 1 existing in the target client is changed to be matched with the two application scenes at the same time.

According to the technical scheme, the server acquires training data sent by the target client in real time, calculates semantic features in at least one scene dimension according to historical scene information and historical voice instructions contained in the training data, finally determines at least one scene semantic recognition model according to the semantic features in each scene dimension, and sends the scene semantic recognition model corresponding to the target identity to the target client, so that the server sends the scene semantic recognition model to the client, the client can directly perform semantic recognition according to the scene semantic recognition model stored locally, the data does not need to be sent to the server for recognition, and the efficiency of semantic recognition is improved.

Example five

Fig. 5 is a schematic structural diagram of a semantic recognition device according to a fifth embodiment of the present invention, where the semantic recognition device includes: a scene information acquisition module 510, a recognition model determination module 520, and a speech information recognition module 530.

The scene information obtaining module 510 is configured to obtain, by using a client, a target identity of a user and target scene information matched with a target voice command according to the target voice command input by the user;

The recognition model determining module 520 is configured to determine, by the client, a target scene semantic recognition model that matches the target scene information from at least one scene semantic recognition model that matches the target identity;

and the voice information recognition module 530 is configured to recognize the target voice command according to the target scene semantic recognition model by the client.

Optionally, the semantic recognition device further includes:

the training data acquisition module is used for collecting at least one historical voice instruction input by a user, the identity of the user and the historical scene information matched with the historical voice instruction as training data in real time before the client acquires the target identity of the user and the target scene information matched with the target voice instruction according to the target voice instruction input by the user;

the training data uploading module is used for uploading the training data to a server in real time so that the server generates at least one scene semantic recognition model corresponding to each identity mark according to the training data;

the identification model receiving module is used for receiving at least one scene semantic identification model which is issued by the server and corresponds to each identity mark.

Optionally, the semantic recognition device further includes:

and the scene information uploading module is used for uploading the target voice command of the user, the target identity and the target scene information to the server after the client identifies the target voice command according to the target scene semantic identification model, so that the server updates at least one scene semantic identification model matched with the target identity.

Optionally, the semantic recognition device further includes:

the voiceprint registration guiding module is used for guiding the target user to carry out voiceprint registration when the client detects that the target user logs in the client for the first time, and storing voiceprint information of the target user and the identity of the target user correspondingly;

The semantic recognition device provided by the embodiment of the invention can execute the semantic recognition method provided by the first embodiment and the second embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example six

Fig. 6 is a schematic structural diagram of a semantic recognition device according to a sixth embodiment of the present invention, where the semantic recognition device includes: a training data acquisition module 610, a semantic feature computation module 620, and an identification model determination module 630.

The training data acquisition module 610 is configured to acquire training data sent by a target client in real time, and group each training data according to an identity identifier included in the training data;

the semantic feature calculating module 620 is configured to calculate, when the server determines that the target packet corresponding to the target identity meets the model generating condition, semantic features in at least one scene dimension according to historical scene information and historical voice instructions included in each training data in the target packet;

The recognition model determining module 630 is configured to determine at least one scene semantic recognition model according to semantic features in each scene dimension by using the server, and send the scene semantic recognition model corresponding to the target identity to the target client.

Optionally, the semantic feature calculating module 620 includes:

the first semantic feature set acquisition unit is used for dividing the time information in the historical scene information into a set number of time periods when the server determines that the training data in the target group reaches a set threshold value, and counting semantic categories corresponding to the historical voice instructions in each time period and the occurrence times of each semantic category to acquire a first semantic feature set in a time dimension;

The second semantic feature set acquisition unit is used for dividing the position information in the historical scene information into a set number of areas, counting semantic categories corresponding to the historical voice instructions in each area and the occurrence times of each semantic category, and acquiring a second semantic feature set in the space dimension;

the third semantic feature set obtaining unit is used for combining each time period with each region in pairs to obtain semantic categories corresponding to the historical voice instructions in the set time period and the set region and occurrence times of each semantic category, and obtaining a third semantic feature set;

the first semantic feature set, the second semantic feature set and the third semantic feature set are all composed of at least one semantic feature, and the semantic feature is composed of a set number of semantic categories.

Optionally, the identification model determining module 630 includes:

the semantic feature comparison unit is used for comparing the first semantic features in the first semantic feature set with the second semantic features in the second semantic feature set in pairs to determine whether semantic categories contained in the first semantic feature set and the second semantic feature set are the same;

the first recognition model determining unit is used for issuing a semantic recognition model corresponding to the first semantic feature to the target client as a scene semantic recognition model when the semantic categories contained in the first semantic feature and the second semantic feature are all the same;

The second recognition model determining unit is used for determining a scene semantic recognition model according to a set rule and transmitting the scene semantic recognition model to the target client when the semantic categories contained in the first semantic features and the second semantic features are not the same;

Optionally, the second recognition model determining unit is specifically configured to:

Optionally, the semantic recognition device further includes:

and the recognition model deduplication module is used for comparing the scene semantic recognition model with the scene semantic recognition model issued to the target client, and if the scene semantic recognition model is the same as the scene semantic recognition model issued to the target client, the current scene semantic recognition model is not issued any more.

The semantic recognition device provided by the embodiment of the invention can execute the semantic recognition methods provided by the third embodiment and the fourth embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution methods.

Example seven

Fig. 7 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention, and as shown in fig. 7, the electronic device includes a processor 70 and a memory 71; the number of processors 70 in the device may be one or more, one processor 70 being taken as an example in fig. 7; the processor 70 and the memory 71 in the device may be connected by a bus or otherwise, for example in fig. 7.

The memory 71 is used as a computer readable storage medium for storing a software program, a computer executable program, and modules, such as program instructions/modules corresponding to a semantic recognition method in an embodiment of the present invention (for example, the scene information acquisition module 510, the recognition model determination module 520, and the voice information recognition module 530 in the semantic recognition device, or the training data acquisition module 610, the semantic feature calculation module 620, and the recognition model determination module 630 in the semantic recognition device). The processor 70 executes various functional applications of the device and data processing, i.e., implements the above-described semantic recognition method, by running software programs, instructions, and modules stored in the memory 71.

The method comprises the following steps:

The method further comprises the steps of:

The memory 71 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the terminal, etc. In addition, memory 71 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 71 may further include memory remotely located relative to processor 70, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Example eight

An eighth embodiment of the present invention also provides a computer-readable storage medium having stored thereon a computer program for performing a semantic recognition method when executed by a computer processor, the method comprising:

the method comprises the following steps:

The method further comprises the steps of:

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.

It should be noted that, in the embodiment of the semantic recognition device, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A semantic recognition method, comprising:

the client identifies the target voice instruction according to the target scene semantic identification model;

the scene semantic recognition models are semantic recognition models matched with specific scenes, each scene semantic recognition model corresponds to at least one application scene, and the application scene comprises time information and position information;

wherein the at least one scene semantic recognition model is determined according to semantic features in each scene dimension;

the semantic features under the at least one scene dimension are calculated according to historical scene information and historical voice instructions included in each training data in the target group when the target group corresponding to the target identity is determined to meet the model generation condition;

when determining that the target group corresponding to the target identity meets the model generation condition, calculating semantic features under at least one scene dimension according to historical scene information and historical voice instructions included in each training data in the target group, wherein the semantic features comprise:

When the training data in the target group reach a set threshold value, dividing the time information in the historical scene information into a set number of time periods, and counting semantic categories corresponding to the historical voice instructions in each time period and the occurrence times of each semantic category to obtain a first semantic feature set in a time dimension;

dividing the position information in the historical scene information into a set number of areas, counting semantic categories corresponding to the historical voice instructions in each area and the occurrence times of each semantic category, and obtaining a second semantic feature set in the space dimension;

combining each time period with each region in pairs to obtain semantic categories corresponding to historical voice instructions in a set time period and a set region and occurrence times of each semantic category, and obtaining a third semantic feature set;

the first semantic feature set, the second semantic feature set and the third semantic feature set are all composed of at least one semantic feature, and the semantic feature is composed of a set number of semantic categories;

the at least one scene semantic recognition model is determined according to semantic features in each scene dimension, comprising:

Comparing the first semantic features in the first semantic feature set with the second semantic features in the second semantic feature set in pairs to determine whether semantic categories contained in the first semantic feature set and the second semantic feature set are the same;

if yes, the semantic recognition model corresponding to the first semantic features is used as a scene semantic recognition model to be issued to the target client;

if not, determining a scene semantic recognition model according to the set rule, and issuing the scene semantic recognition model to a target client;

2. The method of claim 1, further comprising, before the client obtains the target identity of the user and the target scene information matched with the target voice command according to the target voice command input by the user:

3. The method according to claim 2, further comprising, after the client performs recognition processing on the target voice command according to the target scene semantic recognition model:

and the client uploads the target voice instruction, the target identity and the target scene information of the user to the server so that the server updates at least one scene semantic recognition model matched with the target identity.

4. A method according to any one of claims 1-3, further comprising:

when the client detects that the target user logs in the client for the first time, guiding the target user to perform voiceprint registration, and storing voiceprint information of the target user and the identity of the target user correspondingly;

5. A semantic recognition method, comprising:

the server determines at least one scene semantic identification model according to semantic features in each scene dimension, and transmits the scene semantic identification model corresponding to the target identity to a target client;

when the server determines that the target group corresponding to the target identity meets the model generation condition, according to historical scene information and historical voice instructions included in each training data in the target group, calculating semantic features under at least one scene dimension, wherein the semantic features comprise:

when the server determines that training data in the target group reaches a set threshold, dividing time information in the historical scene information into a set number of time periods, and counting semantic categories corresponding to historical voice instructions in each time period and occurrence times of each semantic category to obtain a first semantic feature set in a time dimension;

the server determines at least one scene semantic identification model according to semantic features in each scene dimension, and issues the scene semantic identification model corresponding to the target identity to a target client, and the method comprises the following steps:

6. The method of claim 5, wherein determining a scene semantic recognition model based on the set rules and issuing to the target client comprises:

7. A semantic recognition apparatus, comprising:

the voice information recognition module is used for the client to recognize the target voice instruction according to the target scene semantic recognition model;

The first semantic feature set, the second semantic feature set and the third semantic feature set are all composed of at least one semantic feature, and the semantic feature is composed of a set number of semantic categories; the at least one scene semantic recognition model is determined according to semantic features in each scene dimension, comprising:

8. The apparatus of claim 7, wherein the semantic recognition apparatus further comprises:

9. A semantic recognition apparatus, comprising:

the identification model determining module is used for determining at least one scene semantic identification model according to semantic features in each scene dimension by the server and transmitting the scene semantic identification model corresponding to the target identity to the target client;

the semantic feature calculation module comprises:

an identification model determination module comprising:

10. The apparatus according to claim 9, wherein the second recognition model determining unit is specifically configured to:

11. An electronic device, the device comprising:

one or more processors;

a memory for storing one or more programs;

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the semantic recognition method of any of claims 1-4 or claims 5-6.