CN115691483A

CN115691483A - Semantic understanding equipment and method and storage medium

Info

Publication number: CN115691483A
Application number: CN202110787623.8A
Authority: CN
Inventors: 孟卫明; 何晨迪; 王月岭
Original assignee: Hisense Group Holding Co Ltd
Current assignee: Hisense Group Holding Co Ltd
Priority date: 2021-07-13
Filing date: 2021-07-13
Publication date: 2023-02-03

Abstract

The application provides semantic understanding equipment, a semantic understanding method and a storage medium, which are applied to semantic understanding services. The semantic understanding apparatus includes: the communication interface is used for acquiring target state information of the target user acquired by at least one data acquisition device after receiving a voice interaction instruction triggered by the target user; and the processor is used for acquiring a target semantic understanding result matched with the target state information from the target semantic recording information corresponding to the target user, and performing semantic understanding on the voice interaction instruction based on a semantic understanding model corresponding to the target semantic understanding result, wherein the target semantic recording information comprises the corresponding relation between each piece of historical state information of the target user and the semantic understanding result. The target semantic understanding result of the target user is determined in advance based on the target state information, and the semantic understanding model is screened based on the target semantic understanding result, so that the calling of the semantic understanding model is reduced, the calculation resources and the semantic understanding time are saved, and the semantic understanding accuracy is improved.

Description

Semantic understanding equipment and method and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a semantic understanding apparatus, method, and storage medium.

Background

In the process of semantic understanding through the current semantic understanding method, each user requirement or each scene needs to be calculated, and then comprehensive decision is made by combining the context information of the user voice.

The calculation is carried out aiming at each user requirement or each scene, the calculation data volume is large, the calculation resources are wasted, and the semantic understanding time is long.

Disclosure of Invention

The application provides semantic understanding equipment, a semantic understanding method and a storage medium, which are used for improving the accuracy of semantic understanding and saving computational resources and semantic understanding time.

In a first aspect, an embodiment of the present application provides a semantic understanding device, where the device includes: a communication interface and a processor, wherein:

the communication interface is used for acquiring target state information of the target user acquired by at least one data acquisition device after receiving a voice interaction instruction triggered by the target user;

and the processor is used for acquiring a target semantic understanding result matched with the target state information from target semantic recording information corresponding to the target user, and performing semantic understanding on the voice interaction instruction based on a semantic understanding model corresponding to the target semantic understanding result, wherein the target semantic recording information contains the corresponding relation between each piece of historical state information of the target user and the semantic understanding result.

In a second aspect, an embodiment of the present application provides a method for semantic understanding, where the method includes:

after receiving a voice interaction instruction triggered by a target user, acquiring target state information of the target user acquired by at least one data acquisition device;

acquiring a target semantic understanding result matched with target state information from target semantic record information corresponding to a target user, wherein the target semantic record information comprises corresponding relations between each history state information of the target user and the semantic understanding result;

and performing semantic understanding on the voice interaction instruction based on a semantic understanding model corresponding to the target semantic understanding result.

In a third aspect, an embodiment of the present application provides an apparatus for semantic understanding, where the apparatus includes:

the first acquisition module is used for acquiring target state information of a target user acquired by at least one data acquisition device after receiving a voice interaction instruction triggered by the target user;

the second acquisition module is used for acquiring a target semantic understanding result matched with the target state information from target semantic record information corresponding to the target user, wherein the target semantic record information comprises corresponding relations between each piece of historical state information of the target user and the semantic understanding result;

and the semantic understanding module is used for carrying out semantic understanding on the voice interaction instruction based on the semantic understanding model corresponding to the target semantic understanding result.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions, which when executed by a processor, implement the method steps for semantic understanding provided by embodiments of the present application.

The embodiment of the application has the following beneficial effects:

the embodiment of the application provides semantic understanding equipment, a semantic understanding method and a storage medium, which are applied to semantic understanding service in the field of voice interaction; when semantic understanding is carried out through semantic understanding equipment, firstly, target state information of a target user, which is acquired by at least one data acquisition equipment, is acquired; then according to the target state information, obtaining a target semantic understanding result matched with the target state information in target semantic record information corresponding to a target user; and finally, performing semantic understanding on the voice interaction instruction based on a semantic understanding model corresponding to the target semantic understanding result. The semantic understanding model is screened based on the target state information, the semantic understanding model under each scene and/or requirement does not need to be called, only the semantic understanding model matched with the target state information is called, the calling of the semantic understanding model is reduced, the computational resources and the semantic understanding time are saved, and meanwhile the accuracy of semantic understanding is improved.

Additional feature vectors and advantages of the present application will be set forth in the description which follows, and submodels will be apparent from the description or may be learned by practice of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application;

fig. 2 is a structural diagram of a semantic understanding apparatus according to an embodiment of the present application;

FIG. 3 is a flow chart of a semantic understanding method according to an embodiment of the present application;

fig. 4 is a flowchart of a user feature information binding method according to an embodiment of the present application;

fig. 5 is a diagram of a semantic understanding apparatus according to an embodiment of the present application.

Detailed Description

In order to make the purpose, technical solution and advantages of the present application more clearly and clearly understood, the technical solution in the embodiments of the present application will be described below in detail and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein.

The following briefly introduces the design concept of the embodiments of the present application.

The embodiment of the application relates to semantic understanding service in the field of voice interaction, which is used for performing semantic understanding on a voice interaction instruction triggered by a user, identifying the intention of the user and providing corresponding service.

In the related technology, in the process of semantic understanding, each user requirement and/or each application scene needs to be calculated, then, in combination with context information of a user voice interaction instruction, comprehensive decision is carried out on each obtained semantic understanding result, the semantic understanding result is screened out, and the user intention is identified.

Only by combining context information of a user voice interaction instruction, comprehensive decision is made on semantic understanding results, so that the screened semantic understanding results are not accurate enough; in the semantic understanding process, each user requirement and/or each application scene is calculated, so that the calculation workload is large and the time spent is long.

In view of this, embodiments of the present application provide a semantic understanding apparatus and method, so as to improve accuracy of semantic understanding, and save computational resources and time for semantic understanding.

In the embodiment of the application, a semantic understanding device is provided, the semantic understanding device is deployed in a user family, has the capabilities of local storage and semantic understanding, and can interact with at least one data acquisition device to acquire target state information of a target user acquired by the at least one data acquisition device.

The semantic understanding device locally stores semantic record information corresponding to each user, and the semantic record information contains corresponding relations between history state information corresponding to each user and semantic understanding results.

When semantic understanding is carried out through semantic understanding equipment, the semantic understanding equipment acquires target state information of a target user, which is acquired by at least one data acquisition equipment, after receiving voice interaction triggered by the target user; matching the target state information with each historical state information in the target semantic record information corresponding to the target user, and determining a target semantic understanding result corresponding to the successfully matched historical state information; and then, performing semantic understanding on the voice interaction instruction based on a semantic understanding model corresponding to the target semantic understanding result, identifying the user intention, and providing corresponding service for the user.

In the process, a target semantic understanding result of the target user for representing the requirement and/or the scene under the target state information is predicted based on the target state information of the target user and the historical state information corresponding to the target user, a semantic understanding model corresponding to the target semantic understanding result is determined, the voice interaction instruction is subjected to semantic understanding through the determined semantic understanding model, calculation of each application scene and/or requirement is not needed, computing resources and semantic understanding time are saved, meanwhile, comprehensive decision is made by combining the requirement or the scene, and the semantic understanding accuracy is improved.

After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In the specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

Referring to fig. 1, fig. 1 is a schematic diagram of a possible application scenario provided in an embodiment of the present application, where the application scenario includes a data acquisition device 10, a home router 20, and a semantic understanding device 30.

In one possible implementation, IP addresses are allocated to the data acquisition device 10 and the semantic understanding device 30 through the home router 20, so that the data acquisition device 10 and the semantic understanding device 30 are connected through a network, and data acquired by the data acquisition device 10 is forwarded to the semantic understanding device 30 through the home router 20.

The data acquisition device 10 includes, but is not limited to:

the system comprises a video data acquisition device 10-1 for acquiring video data, wherein the video data acquisition device 10-1 can be a privacy-enhanced camera deployed in a user family, an intelligent device with an audio and video acquisition function and the like;

the audio data acquisition equipment 10-2 is used for acquiring audio data, and the audio data acquisition equipment 10-2 can be an intelligent sound box in a set product of the semantic understanding equipment 30, which is also called a set intelligent sound box, intelligent equipment with an audio and video acquisition function and the like;

the sensor 10-3 is used for identifying user position information, user contour characteristic information and user heartbeat characteristic information, and the sensor 10-3 can be a household sensor adopting millimeter wave radar technology and is also called a household millimeter wave radar sensor.

It should be noted that, compared with the conventional camera, the privacy-enhanced camera can only transmit video data inside the lan, and encrypt the video data through a preset encryption algorithm, so as to prevent the video data from being stolen and cause privacy information leakage, for example, the privacy-enhanced camera transmits the encrypted video data to the semantic understanding device 30 through the home router 20, and the semantic understanding device 30 decrypts the encrypted video data according to the preset decryption algorithm to obtain the video data;

the intelligent device with the audio and video acquisition function can be an intelligent household appliance with audio and video acquisition capability in a family, such as a social television with both video and audio acquisition capability; the mobile terminal can be an intelligent handheld terminal device, such as a mobile phone, a tablet computer and the like; and the system also can be miniaturized intelligent panel equipment with certain audio and video acquisition capability in a household. The intelligent device with the audio and video acquisition function is communicated with the semantic understanding device 30, the acquired audio data and video data are transmitted to the semantic understanding device 30 through the home router 20, and the semantic understanding device 30 processes the audio data and the video data to realize home linkage or intelligent service of the intelligent device;

compared with a common intelligent sound box, the set of intelligent sound box has the advantages that audio data collected by the set of intelligent sound box is not directly sent out to a cloud platform for voice recognition and semantic understanding, the collected audio data is transmitted to the semantic understanding equipment 30 through the home router 20, the semantic understanding equipment 30 performs voice recognition and semantic understanding, and service is provided for users;

the household millimeter wave radar sensor is applied to a scene where a user is sensitive to the video data acquisition equipment 10-1, such as a toilet, and can be used for identifying user position information and acquiring user characteristic information such as user contour characteristic information and user heartbeat characteristic information.

The semantic understanding device 30 may be a home edge computing server, which is a home data storage center, an edge computing center, a privacy protection center, and a friendly interaction center.

In the embodiment of the application, a data storage service, a data analysis service and a semantic understanding service are deployed in a home edge computing server;

wherein the data storage service: identifying collected audio data and video data as well as user position information and user characteristic information collected by a household millimeter wave radar sensor for a household edge computing server, and locally storing user state information obtained by identification;

data analysis service: analyzing the scene and/or the requirement of the user for the family edge computing server based on the acquired user state information and historical state information, and determining a semantic understanding model under the analyzed scene and/or requirement;

semantic understanding service: after a voice interaction instruction is triggered by a user, a semantic understanding model under the analyzed scene and/or requirement is called for the home edge computing server to carry out semantic understanding on voice recognition content of the voice interaction instruction, so that the accuracy of intention understanding of the user is improved, and the occupation of algorithm resources is reduced.

Referring to fig. 2, fig. 2 shows a schematic structural diagram of the semantic understanding apparatus 30 in the embodiment of the present application.

The following specifically describes the embodiment of the present application by taking the semantic understanding apparatus 30 as an example. It should be understood that the semantic understanding apparatus 30 shown in fig. 2 is only one example, and the semantic understanding apparatus 30 may have more or less components than shown in fig. 2, may combine two or more components, or may have a different configuration of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

A hardware configuration block diagram of the semantic understanding apparatus 30 according to an exemplary embodiment is exemplarily shown in fig. 2. As shown in fig. 2, the semantic understanding apparatus 30 includes: memory 310, communication interface 320, processor 330, power source 340, bus 350, and the like.

Memory 310 may be used to store software programs and data. The processor 350 performs various functions of the semantic understanding apparatus 30 and data processing by executing software programs or data stored in the memory 310.

Memory 310 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 3101 and/or cache memory 3102, and may further include Read Only Memory (ROM) 3103.

Memory 310 may also include a program/utility 3105 having a set (at least each) of program modules 3104, such program modules 3104 including, but not limited to: an operating system, each or a number of application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

In the embodiment of the present application, the memory 310 may store semantic record information, which includes but is not limited to: the method comprises the steps of obtaining time periods corresponding to historical state information corresponding to users, corresponding relations between the historical state information corresponding to the users and semantic understanding results, corresponding relations between the historical state information corresponding to the users and voice recognition contents, the number of records of the semantic understanding results corresponding to the historical state information and the confidence degrees of the semantic understanding results corresponding to the historical state information.

The communication interface 320 is used for receiving user data transmitted by each data acquisition device, and the user data includes but is not limited to: the system comprises audio data and video data acquired by audio and video acquisition equipment, user characteristic information such as user contour and heartbeat acquired by a sensor and user position information;

the semantic understanding device 30 also sends a control instruction to the controlled device through the communication interface 320, or sends a request message to the extranet service through the communication interface 320;

for example, the semantic understanding device 30 determines, based on the semantic understanding model corresponding to the target semantic understanding result, that the target user intends to trigger the voice interaction instruction to control a certain controlled device in the home scene, and at this time, the semantic understanding device 30 sends a corresponding control instruction to the controlled device through the communication interface 320;

for another example, the semantic understanding device 30 determines that the intention of the target user to trigger the voice interaction instruction is to request the network media resource based on the semantic understanding model corresponding to the target semantic understanding result, at this time, the semantic understanding device 30 sends request information to the extranet server through the communication interface 320 to request the corresponding network media resource.

The processor 330 is a control center of the semantic understanding apparatus 30, connects various parts of the entire terminal using various interfaces and lines, and performs any method step of semantic understanding provided in the embodiment of the present application by running or executing a software program stored in the memory 310 and calling data stored in the memory 310.

In some embodiments, processor 330 may include one or more processing units; the processor 330 may also integrate an application processor, which mainly handles operating systems, user interfaces, applications, etc., and a baseband processor, which mainly handles wireless communications. It will be appreciated that the baseband processor described above may not be integrated into the processor 330.

In this embodiment of the application, the processor 330 is configured to obtain a target semantic understanding result matched with the target state information from the semantic recording information, and perform semantic understanding on the voice interaction instruction based on a semantic understanding model corresponding to the target semantic understanding result, where the semantic recording information includes a corresponding relationship between history state information of each user and the semantic understanding result.

In a possible implementation manner, the target semantic recording information further includes: time periods corresponding to the historical state information respectively;

the processor 330 is specifically configured to: determining target time for acquiring target state information; searching a target time period matched with the target time in the target semantic record information, and determining historical state information corresponding to the target time period; and searching the historical state information matched with the target state information in the historical state information corresponding to the target time period, and taking at least one semantic understanding result corresponding to the matched historical state information as a target semantic understanding result.

In a possible implementation manner, the target semantic recording information further includes: the number of records of semantic understanding results corresponding to each historical state information;

the processor 330 is specifically configured to: searching historical state information matched with the target state information in the target semantic record information, and acquiring at least one semantic understanding result corresponding to the matched historical state information; determining the corresponding target record number aiming at any one acquired semantic understanding result, and determining the total record number based on each acquired target record number; determining a corresponding target record number and a proportion value between the target record number and the total record number according to any acquired semantic understanding result; and taking the semantic understanding result with the proportion value reaching the first threshold value as a target semantic understanding result.

In a possible implementation manner, the target semantic recording information further includes: the confidence of the semantic understanding result corresponding to each historical state information;

the processor 330 is specifically configured to: searching historical state information matched with the target state information in the target semantic record information, and acquiring at least one semantic understanding result corresponding to the matched historical state information; respectively determining corresponding confidence degrees aiming at any one obtained semantic understanding result; and taking the semantic understanding result with the confidence coefficient reaching a second threshold value as a target semantic understanding result.

In one possible implementation, the processor 330 finds the historical state information matching the target state information specifically by:

aiming at any one piece of historical state information in the target semantic record information, matching each piece of dimension information contained in the target state information with corresponding dimension information contained in the historical state information respectively, and determining the number of targets successfully matched with the dimension information;

and acquiring the historical state information of which the target number reaches a third threshold value as the historical state information matched with the target state information.

In one possible implementation, if there are at least two data acquisition devices, the processor 330 is further configured to: and binding the user characteristic information of the target user, which is acquired by the at least two data acquisition devices respectively.

In one possible implementation, the binding condition is:

and determining that the target user in the target area does not complete the binding between the user characteristic information acquired by the at least two data acquisition devices respectively, and allowing the at least two data acquisition devices to acquire the user characteristic information of the target user.

In one possible implementation, after the processor 330 semantically understands the voice interaction instruction, the processor is further configured to:

if the semantic understanding result is that the network media resource is requested, sending request information to an external network server to acquire the corresponding network media resource; or

And if the semantic understanding result is that the control instruction is issued, sending the control instruction to controlled equipment in the same local area network so that the controlled equipment works according to the control instruction.

Semantic understanding device 30 also includes a power source 340 (such as a battery) that powers the various components. The power source 340 may be logically connected to the processor 330 through a power management system, so as to manage charging, discharging, and power consumption functions through the power management system. The semantic understanding device 30 may also be configured with power buttons for power on and power off functions of the semantic understanding device 30.

The semantic understanding device 30 also includes a bus 350 for connecting various components within the semantic understanding device 30, the bus 350 representing one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

It should be noted that the semantic understanding device 30 provided in the embodiment of the present application further includes a bluetooth module, a Wireless Fidelity (Wi-Fi) module, and the like; wherein:

and the Bluetooth module is used for performing information interaction with other Bluetooth equipment with the Bluetooth module through a Bluetooth protocol. For example, the semantic understanding device 30 may establish a bluetooth connection with a wearable electronic device (e.g., a smart watch) that is also equipped with a bluetooth module through the bluetooth module, so as to perform data interaction;

Wi-Fi belongs to a short-distance wireless transmission technology, and the semantic understanding device 30 can help a user to receive and send emails, browse webpages, access streaming media and the like through a Wi-Fi module, and provides wireless broadband internet access for the user.

Based on the above application scenarios, the method for semantic understanding provided by the exemplary embodiments of the present application is described below with reference to the above application scenarios described above, and it should be noted that the above application scenarios are only illustrated for facilitating understanding of the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect.

Referring to fig. 3, fig. 3 exemplarily provides a semantic understanding method applicable to a home edge computing server in an embodiment of the present application, and the method includes:

step S300, after receiving a voice interaction instruction triggered by a target user, acquiring target state information of the target user acquired by at least one data acquisition device.

In order to ensure the accuracy of the semantic understanding result, in the embodiment of the application, in the process of performing semantic understanding, in addition to acquiring the voice recognition content of the voice interaction instruction, target state information of the target user needs to be determined, so as to determine the target semantic understanding result based on the target state information, where the target semantic understanding result is used for representing the requirement and/or the scene of the target user.

The target state information is user state information corresponding to the target user, and the user state information includes but is not limited to:

user emotion information, user wearing information, user behavior and action information and user position information.

In a possible implementation manner, the user status information is obtained by the home edge computing server after identifying user data collected by at least one data collection device based on a built-in identification algorithm, where the user data includes but is not limited to: audio data, video data, data collected by a household millimeter wave radar sensor.

For example, the home edge computing server receives video data acquired by the video data acquisition device through the communication interface, and performs recognition processing on the video data through a built-in recognition algorithm to acquire user state information, where the recognition processing includes, but is not limited to:

identifying user emotion information of each user contained in the video data through a facial expression identification algorithm;

identifying user wearing information of each user contained in the video data through a clothing identification algorithm;

and identifying the user behavior action information of each user contained in the video data through a user behavior action identification algorithm.

It should be noted that, when the video data is subjected to the identification processing, the face feature information of each user included in the video data is also identified through a face identification algorithm, where the face feature information is used to determine the user identity of each user included in the video data; and a human body tracking algorithm is also needed to be adopted to bind the user identity of each user contained in the video data with the user emotion information, the user wearing information and the user behavior and action information.

For example, the home edge computing server receives audio data acquired by the audio data acquisition device through the communication interface, and identifies the audio data through a built-in identification algorithm to acquire user state information; wherein, the identification processing process is as follows: and identifying the voice identification content corresponding to each user in the audio data through a voice identification algorithm.

It should be noted that when the audio data is subjected to identification processing, voiceprint feature information of each user included in the audio data is also identified through a voiceprint identification algorithm, where the voiceprint feature information is used to determine a user identity of each user included in the audio data; and binding the voiceprint characteristic information of each user with the corresponding voice recognition content.

Meanwhile, the home edge computing server of the embodiment of the application also identifies the semantic understanding result of the voice recognition content in the audio data through a local semantic understanding algorithm.

Under a special condition, when the audio data acquisition equipment is a nested intelligent sound box, the user position information can be bound through a preset sound-collecting sound box.

For example, the home edge computing server receives user position information, user profile characteristic information, user heartbeat characteristic information and the like acquired by the home millimeter wave radar sensor through the communication interface, wherein the user profile characteristic information and the user heartbeat characteristic information are used for determining the user identity; and binding the user identity with the corresponding user location information.

In order to make the semantic understanding result more accurate, a plurality of data acquisition devices are usually installed in a family scene, and user data are acquired through the plurality of data acquisition devices; and each data acquisition device transmits the user data acquired by the data acquisition device to the home edge computing server, so that the home edge computing server identifies and processes the acquired user data and determines the user identity of each user and corresponding user state information.

How to determine the same user is also a problem to be solved in the embodiment of the present application based on voiceprint feature information recognized in the audio data for determining the user identity, face feature information recognized in the video data for determining the user identity, and user profile feature information and user heartbeat feature information acquired by the sensor for determining the user identity;

for example, the voiceprint feature information identified by the audio data includes voiceprint feature information a and voiceprint feature information B, and if only the voiceprint feature information includes the audio data acquisition device, the user identity a can be directly allocated to the voiceprint feature information a, and the user identity B is allocated to the voiceprint feature information B; however, if the face feature information is also recognized through the video data, and the face feature information includes face feature information a and face feature information B, the face feature information a corresponds to the voiceprint feature information a or the voiceprint feature information B, that is, whether the face feature information a corresponds to the user identity a or the user identity B cannot be determined; that is, how to determine whether the user characteristic information included in the user data is the user characteristic information of the same user based on the user data adopted by the plurality of data acquisition devices is also a problem to be solved in the embodiment of the present application.

In consideration of the operation specialty of user binding and the operation and maintenance problems possibly existing after the data acquisition equipment enters thousands of households, the embodiment of the application provides a user characteristic information binding method for determining identity identification, which is acquired by a plurality of data acquisition equipment;

referring to fig. 4, fig. 4 exemplarily provides a flowchart of a method for binding user feature information in an embodiment of the present application, including the following steps:

step S400, when the home edge computing server identifies that user data collected by target data collection equipment contains user characteristic information through a built-in identification algorithm in the working process, unique user identity identification is distributed to a user corresponding to the identified user characteristic information;

for example, the target data acquisition device is a video data acquisition device, the home edge computing server calls a face recognition algorithm, recognizes that the video data contains face feature information, and allocates a unique user identity to a user in a home corresponding to each piece of face feature information; or

The target data acquisition equipment is audio data acquisition equipment, the home edge computing server calls a voiceprint recognition algorithm, recognizes that the audio data contains voiceprint characteristic information, and allocates a unique user identity to a user in a home corresponding to each voiceprint characteristic information; or

The target data acquisition equipment is a millimeter wave radar sensor, and the home edge computing server allocates unique user identity identifiers to the users in the home corresponding to the user profile characteristic information and/or the user heartbeat characteristic information based on the user profile characteristic information and/or the user heartbeat characteristic information transmitted by the millimeter wave radar sensor.

It should be noted that, when the user identity is allocated to the user in the family, the user identity is allocated only to the newly added user, and the binding between the user feature information is executed for the newly added user; therefore, when the target user is a newly added user, the user identity is distributed to the target user, and the user feature information of the target user, which is acquired by the data acquisition equipment, is bound.

Step S401, the family edge computing server determines that a binding condition is met;

in a possible implementation manner, the home edge computing server determines that the home only includes one user based on the user data acquired by the target data acquisition device, does not complete the binding between the user feature information acquired by each of the plurality of data acquisition devices, allows the plurality of data acquisition devices to acquire the user feature information of the target user, determines that the binding condition is satisfied, and acquires the user identity of the user.

In another possible implementation manner, the home edge computing server determines that the home includes multiple users based on the user data acquired by the target data acquisition device, but one of the users does not complete the binding between the user feature information acquired by the multiple data acquisition devices, allows the multiple data acquisition devices to acquire the user feature information of the target user, determines that the binding condition is met, and acquires the user identity of the user who finishes the ice-lolly.

In another possible implementation manner, the home edge computing server determines that a target location (such as a bedroom, a living room, and the like) in a home only includes one user based on user data acquired by a target data acquisition device, does not complete binding between user feature information acquired by each of a plurality of data acquisition devices, allows the plurality of data acquisition devices to acquire the user feature information of the target user, determines that a binding condition is met, and acquires a user identity of the user.

Step S402, the home edge computing server waits for other data acquisition devices to upload the acquired user data.

Step S403, the home edge computing server identifies the received user data collected by other data collection devices, obtains user characteristic information, and binds the obtained user characteristic information with the user characteristic information collected by the target data collection device and a corresponding user identity.

Step S404, the home edge computing server determines whether to complete the binding between the user feature information collected by the multiple data collecting devices, if yes, step S405 is executed, otherwise, step S402 is executed.

In a possible implementation manner, when recognizing that the user data acquired by other data acquisition devices contains a plurality of user characteristic information according to the received user data, the home edge computing server determines whether the plurality of user characteristic information are allocated with the user identification, and if the user identification is allocated, determines that the binding between the user characteristic information acquired by the plurality of data acquisition devices is not completed; if one piece of user characteristic information is not bound, binding the user characteristic information with the user characteristic information and the corresponding user identity identification which are acquired by the target data acquisition equipment; and if the two pieces of user characteristic information are not bound, determining that the binding between the user characteristic information collected by the plurality of data collecting devices is not finished.

In another possible implementation manner, when the home edge computing server recognizes that the user characteristic information contains one piece of user characteristic information according to the received user data acquired by other data acquisition equipment, and the user characteristic information is not allocated with a user identity, the home edge computing server binds the user characteristic information with the user characteristic information acquired by the target data acquisition equipment and a corresponding user identity; and if the binding result exists, optimizing the binding.

Step S405, the home edge computing server stores the binding relationship between the user identification and the user characteristic information collected by the plurality of data collecting devices.

As shown in table 1, the binding relationship between the user id and the user characteristic information collected by the multiple data collection devices is shown.

TABLE 1

User identification

Face feature information

Voiceprint feature information

Contour feature information

Heartbeat feature information

In the embodiment of the application, after the home edge computing server executes the binding of the user identity and the user feature information acquired by the plurality of data acquisition devices, after the user determines the user identity through any user feature information of the face feature information, the voiceprint feature information, the profile feature information and the heartbeat feature information, the home edge computing server can synchronously associate the identification results of the same user at the nearby moment through other data acquisition devices, and perform cross-media perception.

When the target state information includes four dimensional information, namely target user emotion information, target user wearing information, target user behavior and action information and target user position information, the information, such as the target user emotion information, the target user wearing information, the target user behavior and action information and the target user position information, can not be acquired in real time or simultaneously; in the embodiment of the application, a certain time window is set when the target state information is acquired, the last identification result of the target user in the time window is acquired, namely the emotion information of the target user acquired by the last identification in the time window, the wearing information of the target user acquired by the last identification, the behavior and action information of the target user acquired by the last identification and the position information of the target user acquired by the last identification.

It should be noted that setting the time window for obtaining the state information of the target user is a compensation strategy, and on the premise of invoking the recognition result as much as possible during semantic understanding, the gesture action of the user before sleeping cannot be taken as the basis for the voice interaction information which is not recognized for a long time, for example, after the user gets up at night. For example, the speech recognition of the intelligent sound box of the system is triggered to acquire the user identity and the position information of the user at present, other information is not acquired, and at this time, the latest recognition result of each dimension information in the time window is used for filling.

Step S301, in the target semantic record information corresponding to the target user, obtaining a target semantic understanding result matched with the target state information.

In order to save computational resources and save semantic understanding time, in the embodiment of the application, a semantic understanding model required by each user and/or in each scene is not called any more to perform semantic understanding, but a target semantic understanding result used for representing the requirement of a target user and/or the scene is predicted based on target state information of the target user, a semantic understanding model corresponding to the target semantic understanding result is determined, and then semantic understanding service is performed based on the determined semantic understanding model. At the moment, in the calculation process, semantic understanding service is carried out without calling all the semantic understanding models under the requirements and/or scenes, so that the calculation power resource and the semantic understanding time are saved.

In the embodiment of the application, when a target semantic understanding result matched with the target state information is obtained, matching is mainly performed based on the target state information and the historical state information of the target user recorded in the target semantic recording information, the historical state information which is successfully matched is obtained, and a target semantic understanding result which is corresponding to the historical state information which is successfully matched and used for representing the requirement and/or the scene of the target user is determined; therefore, the target semantic record information mainly includes: and the historical state information of the target user corresponds to the semantic understanding result. As shown in table 2:

TABLE 2

User identification	User emotion	Worn by user	User behavioral actions	User location	Semantic understanding results
						User A	Not happy	Overcoat	Lying down	Bedroom	Playing songs
User A	Not happy	Overcoat	Lying down	Parlor	Weather in open sky
						User A	Not happy	Overcoat	Lying down	Bedroom	Speak joke
……	……	……	……	……	……

In the embodiment of the application, after the home edge computing server obtains user data acquired by at least one data acquisition device, the home edge computing server identifies the user data based on a built-in identification algorithm and determines user characteristic information and corresponding user state information acquired by each data acquisition device;

for example, the face feature information and the corresponding user emotion, user wearing and user behavior are recognized based on the video data, the voiceprint feature information and the corresponding voice recognition content corresponding to the voice interaction instruction are recognized based on the audio data, and the contour feature information and the corresponding user position are recognized based on the data transmitted by the household millimeter wave radar sensor.

Because a plurality of user characteristic information and user identity marks are bound, the user identity marks can be determined, and target state information corresponding to the user identity marks is determined, wherein the target state information comprises dimension information such as user emotion, user wearing, user behavior and user position.

After the user identity identification and the corresponding target state information are determined, the target state information is matched with each piece of historical state information corresponding to the user identity identification, and a semantic understanding result corresponding to the historical state information which is successfully matched is obtained and serves as a target semantic understanding result.

In a possible implementation manner, when matching the target state information with the historical state information, for any historical state information, matching each dimension information in the target state information with corresponding dimension information in the historical state information, determining the target number of the successfully matched dimension information, and then acquiring the historical state information of which the target number reaches a third threshold value as the historical state information successfully matched with the target state information;

for example, matching the user emotion in the target state information with the user emotion in the history state information, matching the user wearing in the target state information with the user wearing in the history state information, matching the user behavior action in the target state information with the user behavior action in the history state information, matching the user position in the target state information with the user position in the history state information, and determining the target number of successfully matched dimension information, if the target number reaches a third threshold value, determining that the history state information and the target state information are successfully matched, and acquiring a semantic understanding result corresponding to the history state information as a target semantic understanding result.

In the embodiment of the present application, in order to make the determined target semantic understanding result and the semantic understanding model corresponding to the target semantic understanding result more accurate, in consideration of that the requirements and the scenes of the target user are different in different time periods, that is, the target semantic understanding results in different time periods, target semantic record information further including time periods corresponding to the historical state information is provided, as shown in table 3:

TABLE 3

When the target semantic record information contains time periods corresponding to the historical state information, when a target semantic understanding result is determined, determining target time for acquiring the target state information, searching the target time period matched with the target time in the target semantic record information, determining the historical state information corresponding to the target time period, searching the historical state information matched with the target state information in the historical state information corresponding to the target time period, and taking at least one semantic understanding result corresponding to the matched historical state information as the target semantic understanding result.

In a possible implementation manner, when matching the target state information with the historical state information, for any historical state information, matching each dimension information in the target state information with corresponding dimension information in the historical state information, determining a target number of successfully matched dimension information, and then acquiring the historical state information of which the target number reaches a third threshold value as the historical state information successfully matched with the target state information.

The method comprises the steps of firstly screening historical state information once based on the target time for obtaining the target state information, then matching the target state information with the historical state information, carrying out secondary screening on the historical state information, and finally taking a semantic understanding result corresponding to the historical state information obtained by secondary screening as a target semantic understanding result, so that the accuracy of the target semantic understanding result and a corresponding semantic understanding model is improved, and the accuracy of semantic understanding is further improved.

In this embodiment of the present application, in order to filter out a target semantic understanding result that better meets a target user, the target semantic recording information further includes: the number of records of each semantic understanding result corresponding to each history state information is shown in table 4:

TABLE 4

User identification

User emotion

Worn by user

User behavioral actions

User location

Semantic understanding results

Number of records

User A

Is not happy

Overcoat

Lying down

Bedroom

Playing songs

80

User A

Is not happy

Overcoat

Lying down

Bedroom

Weather of tomorrow

10

User A

Not happy

Overcoat

Lying down

Bedroom

Smiling talk

3

……

When the target semantic record information contains the record number of each semantic understanding result, searching the historical state information matched with the target state information in the target semantic record information when the target semantic understanding result is determined, and acquiring at least one semantic understanding result corresponding to the successfully matched historical state information; determining the corresponding target record number according to any one acquired semantic understanding result, determining the total record number based on each acquired target record number, and then determining the corresponding target record number and the proportion value between the total record number; and finally, taking the semantic understanding result with the proportion value reaching the first threshold value as a target semantic understanding result.

As shown in table 4, if the target status information is: the semantic understanding result corresponding to the history state information successfully matched with the target state information comprises the following semantic understanding results, namely, the user is in emotion-unhappy state, the user wears-overcoat, the user behavior action-lying state, the user position-bedroom, and the semantic understanding result corresponds to the history state information successfully matched with the target state information: playing songs, tomorrow weather and speaking jokes, wherein the corresponding record numbers of the three semantic understanding results are respectively as follows: 80. 10, 3, the total number of records is 93, so the corresponding ratio values are: 0.86, 0.11, 0.03; if the first threshold is 0.5; the played song is the target semantic understanding result.

In a possible implementation manner, when the target semantic record information contains the record number of each semantic understanding result, when the target semantic understanding result is determined, searching historical state information matched with the target state information in the target semantic record information, and acquiring at least one semantic understanding result corresponding to the successfully matched historical state information; and determining the corresponding target record number aiming at any one obtained semantic understanding result, and then taking the semantic understanding result of which the target record number is greater than the record number threshold value as a target semantic understanding result.

In a possible implementation manner, when the target state information is matched with the historical state information, for any historical state information, matching each dimension information in the target state information with corresponding dimension information in the historical state information, determining the target number of the successfully matched dimension information, and then acquiring the historical state information of which the target number reaches a third threshold value as the historical state information successfully matched with the target state information.

In this embodiment of the present application, in order to filter out a target semantic understanding result that better meets a target user, the target semantic recording information further includes: the confidence of each semantic understanding result corresponding to each historical state information is shown in table 5:

TABLE 5

User identification

User mood

Worn by user

User behavioral actions

User location

Semantic understanding results

Confidence level

User A

Not happy

Overcoat

Lying down

Bedroom

Playing songs

80％

User A

Not happy

Overcoat

Lying down

Bedroom

Weather in open sky

10％

User A

Is not happy

Overcoat

Lying down

Bedroom

Smiling talk

3％

……

Under the condition that the target semantic record information contains the confidence degrees of all semantic understanding results, when the target semantic understanding results are determined, searching historical state information matched with the target state information in the target semantic record information, and acquiring at least one semantic understanding result corresponding to the successfully matched historical state information; respectively determining corresponding confidence degrees aiming at any one obtained semantic understanding result; then, the semantic understanding result with the confidence degree reaching the second threshold value is used as the target semantic understanding result.

As shown in table 5, if the target status information is: the semantic understanding result corresponding to the history state information successfully matched with the target state information comprises the following semantic understanding results, namely, the user is in emotion-unhappy state, the user wears-overcoat, the user behavior action-lying state, the user position-bedroom, and the semantic understanding result corresponds to the history state information successfully matched with the target state information: playing songs, tomorrow weather and speaking jokes, wherein the confidence degrees corresponding to the three semantic understanding results are respectively as follows: 80%, 10% and 3%; if the second threshold is 50%; the played song is the target semantic understanding result.

In the embodiment of the present application, in order to make the screened target semantic understanding result more accurate, the contents in the target semantic record information are integrated with the contents in tables 2 to 5, as shown in table 6:

TABLE 6

When the target semantic record information contains history state information, a time period, a record number, confidence and a semantic understanding result, when the target semantic understanding result is determined:

matching is carried out based on the time for obtaining the target state information and the time period, and historical state information corresponding to the time period in which matching is successful is determined;

aiming at any determined historical state information, matching the historical state information with the target state information, and determining a semantic understanding result corresponding to the successfully matched historical state information;

determining the number of records and the confidence degree corresponding to any semantic understanding result according to the determined semantic understanding result, weighting the number of records and the confidence degree in a weighting processing mode, and determining a weighted result value; and selecting the semantic understanding result with the weighting result value larger than the set threshold value as a target semantic understanding result.

In a possible implementation manner, through data statistical analysis, when it is determined that a certain semantic understanding result is not related to a certain dimension information, the user state corresponding to the dimension information may be set to be null, for example, when the semantic understanding result is not related to a user behavior action, the user behavior action may be set to be null, as shown in table 7:

TABLE 7

It should be noted that the home edge computing server stores semantic record information corresponding to each home user, where the stored semantic record information is a speech recognition content recognized by the home edge computing server based on the acquired speech interaction instruction, user state information recognized by user data collected by the data collection device, a user identity, and a semantic understanding result determined for the speech recognition content, and is updated by superimposing a time period for acquiring the user data, and a specific relationship is shown in table 8:

TABLE 8

And step S302, performing semantic understanding on the voice interaction instruction triggered by the target user based on the semantic understanding model corresponding to the target semantic understanding result.

The semantic understanding is carried out on the voice interaction instruction triggered by the target user based on the semantic understanding model, namely after voice recognition is carried out on the voice interaction instruction, the semantic understanding is carried out on the voice content information obtained by the voice recognition.

And performing semantic understanding, wherein the obtained semantic understanding result is the user intention, and providing a corresponding server for the user based on the semantic understanding result.

And if the semantic understanding result is the issuing of the control instruction, sending the control instruction to the controlled equipment in the same local area network so that the controlled equipment works according to the control instruction.

In the embodiment of the application, the target state information is determined through the sensing capability of the semantic understanding equipment, the semantic understanding model is screened based on the target state information and the historical state information, the semantic understanding model under each scene and/or requirement does not need to be called, only the semantic understanding model matched with the target state information is called, the calling of the semantic understanding model is reduced, the calculation resources and the semantic understanding time are saved, and meanwhile the semantic understanding accuracy is improved.

Based on the same inventive concept, an apparatus 500 for semantic understanding is further provided in the embodiment of the present application, please refer to fig. 5, fig. 5 exemplarily provides an apparatus 500 for semantic understanding in the embodiment of the present application, the apparatus includes:

the first obtaining module 501 is configured to obtain target state information of a target user, which is collected by at least one data collection device, after receiving a voice interaction instruction triggered by the target user;

a second obtaining module 502, configured to obtain, in target semantic record information corresponding to a target user, a target semantic understanding result matched with target state information, where the target semantic record information includes a correspondence between each piece of historical state information of the target user and the semantic understanding result;

and the semantic understanding module 503 is configured to perform semantic understanding on the voice interaction instruction based on the semantic understanding model corresponding to the target semantic understanding result.

the second obtaining module 502 is specifically configured to:

determining target time for acquiring target state information;

searching a target time period matched with the target time in the target semantic record information, and determining historical state information corresponding to the target time period;

and searching the historical state information matched with the target state information in the historical state information corresponding to the target time period, and taking at least one semantic understanding result corresponding to the matched historical state information as a target semantic understanding result.

In a possible implementation manner, the target semantic record information further includes: the number of records of semantic understanding results corresponding to each piece of historical state information;

the second obtaining module 502 is specifically configured to:

searching historical state information matched with the target state information in the target semantic record information, and acquiring at least one semantic understanding result corresponding to the matched historical state information;

determining the corresponding target record number aiming at any one acquired semantic understanding result, and determining the total record number based on each acquired target record number;

determining a corresponding target record number and a proportion value between the target record number and the total record number according to any acquired semantic understanding result;

and taking the semantic understanding result with the proportion value reaching the first threshold value as a target semantic understanding result.

the second obtaining module 502 is specifically configured to:

respectively determining corresponding confidence degrees aiming at any one obtained semantic understanding result;

and taking the semantic understanding result with the confidence coefficient reaching a second threshold value as a target semantic understanding result.

In one possible implementation manner, the second obtaining module 502 searches for the historical status information matching the target status information by:

In a possible implementation manner, if there are at least two data acquisition devices, the apparatus 500 further includes a binding module 504:

the binding module 504 is specifically configured to: and binding the user characteristic information of the target user, which is acquired by the at least two data acquisition devices respectively.

In one possible implementation, the binding condition is: determining that a target user in the target area does not complete the binding between the user characteristic information acquired by the at least two data acquisition devices respectively, and allowing the at least two data acquisition devices to acquire the user characteristic information of the target user.

In one possible implementation manner, after the semantic understanding module 503 semantically understands the voice interaction instruction, the semantic understanding module is further configured to:

In some possible embodiments, the aspects of the method for semantic understanding provided herein may also be implemented in the form of a program product comprising program code for causing a computer device to perform the steps of the method for semantic understanding according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device.

The program product may employ any combination of each or multiple readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having each or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product for transmission control of a short message according to an embodiment of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be executed on a computing device.

A readable signal medium may include a data signal propagating in baseband or as a submodel to a carrier wave, in which readable program code is carried. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages.

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, according to embodiments of the present application, the feature vectors and functions of two or more of the units described above may be embodied in each unit. Conversely, the feature vectors and functions of each unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into each step execution, and/or each step broken down into multiple step executions.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A semantic understanding apparatus, characterized in that the apparatus comprises: a communication interface and a processor, wherein:

the communication interface is used for acquiring target state information of a target user acquired by at least one data acquisition device after receiving a voice interaction instruction triggered by the target user;

the processor is configured to obtain a target semantic understanding result matched with the target state information from target semantic record information corresponding to the target user, and perform semantic understanding on the voice interaction instruction based on a semantic understanding model corresponding to the target semantic understanding result, where the target semantic record information includes a correspondence between each piece of historical state information of the target user and the semantic understanding result.

2. The apparatus of claim 1, wherein the target semantic record information further comprises: time periods corresponding to the historical state information respectively;

the processor is specifically configured to:

determining target time for acquiring the target state information;

searching the target time period matched with the target time in the target semantic record information, and determining historical state information corresponding to the target time period;

and searching historical state information matched with the target state information in the historical state information corresponding to the target time period, and taking at least one semantic understanding result corresponding to the matched historical state information as the target semantic understanding result.

3. The apparatus of claim 1, wherein the target semantic record information further comprises: the number of records of semantic understanding results corresponding to each historical state information;

the processor is specifically configured to:

determining a corresponding target record number and a proportional value between the total record number according to any one acquired semantic understanding result;

and taking the semantic understanding result with the proportion value reaching the first threshold value as the target semantic understanding result.

4. The apparatus of claim 1, wherein the target semantic record information further comprises: the confidence of the semantic understanding result corresponding to each piece of historical state information;

the processor is specifically configured to:

and taking the semantic understanding result with the confidence degree reaching a second threshold value as the target semantic understanding result.

5. The device of any of claims 2 to 4, wherein the processor is to find historical state information that matches the target state information by:

aiming at any one piece of historical state information in the target semantic record information, matching each piece of dimension information contained in the target state information with corresponding dimension information contained in the historical state information respectively, and determining the target number of successfully matched dimension information;

6. The device of claim 1, wherein if there are at least two data acquisition devices, the processor is further configured to:

and binding the user characteristic information of the target user, which is acquired by the at least two data acquisition devices respectively.

7. The apparatus of claim 6, wherein the binding condition is:

determining that the target user in the target area does not complete the binding between the user characteristic information acquired by the at least two data acquisition devices, and allowing the at least two data acquisition devices to acquire the user characteristic information of the target user.

8. The device of claim 1, wherein after the processor semantically understands the voice interaction instructions, further configured to:

9. A method of semantic understanding, the method comprising:

acquiring a target semantic understanding result matched with the target state information from target semantic record information corresponding to the target user, wherein the target semantic record information comprises corresponding relations between each history state information of the target user and the semantic understanding result;

10. A computer-readable storage medium, characterized in that it stores computer instructions which, when executed by a processor, implement the method of semantic understanding of claim 9.