CN111862980A

CN111862980A - Incremental semantic processing method

Info

Publication number: CN111862980A
Application number: CN202010787458.1A
Authority: CN
Inventors: 蔡勇
Original assignee: Zebra Network Technology Co Ltd
Current assignee: Zebra Network Technology Co Ltd
Priority date: 2020-08-07
Filing date: 2020-08-07
Publication date: 2020-10-30

Abstract

The invention provides an increment semantic processing method, which is applied to an electronic device, wherein the electronic device comprises a voice detection module, a voice recognition module and a semantic processing module; the voice recognition module accumulates the received voice signals according to the time sequence in a segmentation mode and converts each segment of voice signals into text information in real time; the voice recognition module sequentially sends the text information of each voice signal segment to the semantic processing module according to the time sequence, and the semantic processing module receives the text information of each voice signal segment in real time, wherein when the semantic processing module receives the text information of each voice signal segment, the semantic feature value of the text information of the voice signal segment is extracted, so that the voice signal is processed in parallel with the voice recognition module. According to the embodiment of the invention, the delay time in man-machine language interaction can be reduced, and the user experience is improved.

Description

Incremental semantic processing method

Technical Field

The application relates to the technical field of natural language processing, in particular to an incremental semantic processing method.

Background

At present, all man-machine voice interaction is divided into three parts of voice recognition, semantic processing and application program. Firstly, the speech recognition part judges the end of the user speaking according to the speech breakpoint detection, then all the characters are sent to the semantic processing engine for processing, and then the application program module executes the subsequent action according to the semantic processing result. The three links are all executed in series, each link causes certain time delay, and bad experience is brought to the use of a user. With several types of automobile product experience in the current mainstream, as shown in fig. 4, the voice detection loop is saved by 500ms, the semantic processing loop is saved by 200ms, and the application program module processing loop is saved by 500 and 1000 ms. The three links add up with a delay of about 1-3 s.

Disclosure of Invention

In view of this, the present application provides an incremental semantic processing method, which can reduce the delay time during human-computer language interaction, quickly understand the voice signal content of a user, give feedback, and improve the voice interaction experience of the user.

In order to solve the technical problem, the following technical scheme is adopted in the application:

in a first aspect, the present application provides an incremental semantic processing method applied to an electronic device, where the electronic device includes a voice detection module, a voice recognition module, and a semantic processing module, and the method includes:

when the voice detection module detects a voice signal, informing the voice recognition module to process the voice signal;

the voice recognition module converts the received voice signals into segmented text information in real time according to a time sequence;

the voice recognition module sequentially sends the text information of each voice signal segment to the semantic processing module according to the time sequence, the semantic processing module receives the text information of each voice signal segment in real time, and when the semantic processing module receives the text information of each voice signal segment, the semantic feature value of the text information of the voice signal segment is extracted, so that the voice signal is processed in parallel with the voice recognition module.

As an embodiment of the first aspect of the present application, the method further comprises:

the semantic processing module sequentially accumulates the semantic feature values corresponding to each section of text information according to a time sequence, scores the semantic feature values obtained for the first time or scores the semantic feature values accumulated based on the semantic feature values obtained for the first time, and sends the semantic feature values or the accumulated semantic feature values to the application program module when the scores of the semantic feature values or the accumulated semantic feature values are larger than a threshold value;

and the application program module receives the semantic feature value corresponding to each section of voice signal in real time and adjusts the output content according to the received semantic feature value each time.

As an embodiment of the first aspect of the present application, the receiving, by the application program module, a semantic feature value corresponding to each segment of speech signal in real time, and adjusting output content according to the received semantic feature value each time includes:

and after receiving the semantic feature values, the application program module obtains a command corresponding to the semantic feature values according to semantic feature value matching, and switches to an application interface for executing the command so as to output interface contents corresponding to the semantic feature values through the application interface.

As an embodiment of the first aspect of the present application, when the voice detection module detects that the voice signal is ended, the application program module executes a command corresponding to a semantic feature value of a complete segment of the voice signal, where the complete segment of the voice signal is the voice signal within a period from when the detection module starts timing the voice signal to when the voice signal is ended.

As an embodiment of the first aspect of the present application, the semantic feature value includes a unique primary key value, and the primary key value is used to identify the first intention of the voice signal.

As an embodiment of the first aspect of the present application, the semantic feature value may further include N secondary key values (N ≧ 0), which are used to further define the range of the first intent of the primary key value.

As an embodiment of the first aspect of the present application, the primary key value and the secondary key value of the voice information value are updated in real time according to the voice signal.

As an embodiment of the first aspect of the present application, the finest granularity information of the text information is a single chinese character or a single word.

As an embodiment of the first aspect of the present application, the storage type of the semantic feature value is a json data type.

As an embodiment of the first aspect of the present application, the semantic processing module 103 employs any one of LSTM, CNN, or transform model.

The technical scheme of the application has at least one of the following beneficial effects:

according to the incremental semantic processing method, the processing time of the voice signals can be shortened, the voice signal content of the user can be understood quickly, feedback is given, delay in the voice interaction process is reduced, and the voice interaction experience of the user is improved.

Drawings

FIG. 1 is a block diagram of an incremental semantic processing method according to an embodiment of the present disclosure;

FIG. 2 is a flowchart of an incremental semantic processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of the runtime of each module according to an embodiment of the present application;

FIG. 4 is a schematic diagram of the runtime of various modules of the prior art;

fig. 5 is a system flowchart of an incremental semantic processing method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The following description will discuss embodiments of the present application with reference to specific examples.

The embodiment of the application is mainly applied to all human-computer voice interaction scenes such as automobiles and home furnishing, and as shown in fig. 1, the electronic device comprises a voice detection module 101, a voice recognition module 102, a semantic processing module 103 and an application program module 200, and the modules run in parallel, so that the processing time of a human-computer interaction process can be effectively reduced. For example, a user wants to listen to a song, say to an electronic device: "i want to listen to liu de hua forgetting water", when the voice detection module 101 recognizes the voice signal of the user, the voice recognition module 102 processes the voice signal in real time, for example, the user says "i want to listen to liu de hua forgetting water", the real-time processing result may be that "i want to listen to" liu de hua "," forgetting water ", and such segmented text information, further, the voice recognition module 102 sequentially sends each piece of text information to the semantic processing module 103 according to a time sequence, the semantic processing module 103 sequentially receives the contents in the above mentioned double quotations, and extracts semantic feature values in the segmented text information, in the practical application of the embodiment of the present application, an intent may be named by creating a json object for storing semantic feature values, for example, when the semantic processing module 103 receives" i want to listen to "text information, it may be determined that the semantic feature value intent is music playing, is marked as { "intent": "palette _ music" }, further, the semantic processing module 103 will score the feature value, and when the score is greater than the threshold, send the semantic feature value to the application module 200, for example, { "intent": "palette _ music" } is sent to the application module 200, and the application module 200 displays the music playing interface. That is, as shown in fig. 3, from the beginning of human speech, the speech recognition module 102, the semantic processing module 103 and the application module 200 process speech signals in parallel, generally, the time of human speech is at least 1 second, while the time of semantic processing module 103 and the application module 200 are mostly between several tens milliseconds and several hundreds milliseconds, when the speech detection module 101 detects the end of speech signal, it takes time t1, the speech recognition module recognizes speech signal and sends text information converted from speech signal to the parallel semantic processing module 103 in real time, the semantic processing module 102 finishes working, delay t2, the semantic processing module 103 transmits the result to the application module 200 in real time, the application module 200 finishes working, delay t3, in practical application, the user performs command from the end of speech to the application module 200, the total delay Td is t1+ t2+ t3, while 1 second of the speaking time of the user is enough for the semantic processing module 103 and the application program module 200 to finish the operation, t2 and t3 are only the communication time among the speech recognition module, the semantic processing module 103 and the application program module 200, and are about tens of milliseconds, so that the overall delay time Td is far shorter than the delay time of t1+ t2+ t3 in the prior art shown in fig. 4, and thus, in the human-computer speech interaction scene, the method can shorten the processing time of the speech signal, quickly understand the speech signal content of the user, give feedback, reduce the delay time in the speech interaction process, and improve the speech interaction experience of the user.

An incremental semantic processing method according to the present application is described below with reference to the accompanying drawings, and fig. 2 shows a flowchart of an incremental semantic processing method, where the method is applied to an electronic device, the electronic device includes a voice detection module, a voice recognition module, and a semantic processing module, and as shown in fig. 2, the method includes:

step S210, when the voice detection module detects the voice signal, the voice recognition module is notified to process the voice signal. That is, the voice detection module detects the voice signal first, and when the user is judged to speak, the voice recognition module is immediately informed to process the voice signal.

In step S220, the speech recognition module converts the received speech signal into segmented text information in real time according to the time sequence. That is, the speech processing module first processes the speech signal in chronological order to convert the speech signal into text information in real time, wherein the speech processing module typically converts the speech signal into segmented text information, for example, a user says: in some embodiments of the present application, the result of the finest granularity of the text information converted by the speech processing module is "i", "want", "listen", "Liu", "De", "Hua", "forget", "Emotion" and "Water".

Step S230, the voice recognition module sequentially sends the text information of each segment of voice signal to the semantic processing module according to the time sequence, and the semantic processing module receives the text information of each segment of voice signal in real time, wherein when the semantic processing module receives each segment of text information of a voice signal, the semantic feature value of the text information of the segment of voice signal is extracted to implement parallel processing of the voice signal with the voice recognition module. That is to say, through the previous step S220, segmented text information is obtained, the speech recognition module sends each piece of text information to the semantic processing module according to the time sequence, the semantic processing module extracts a semantic feature value of the text information, for example, when the semantic processing module receives "i want to listen" in real time, a semantic feature value play _ music may be extracted, in the practical application of the embodiment of the present application, a json object is created, named intent, and used for storing the semantic feature value, and the semantic purpose { "intent": "palette _ music" } identification.

In some embodiments of the present application, the method further comprises:

step S240, the semantic processing module sequentially accumulates the semantic feature values corresponding to each section of text information according to a time sequence, and scores the semantic feature values obtained for the first time or the semantic feature values accumulated based on the semantic feature values obtained for the first time, and when the score of the semantic feature value or the semantic feature value accumulated is greater than a threshold value, the semantic processing module sends the semantic feature value or the semantic feature value accumulated to the application program module. That is, further, when the semantic processing module continues to receive the "liu de hua" text message, the semantic feature value "singer" may be extracted as "liu de hua", and added to the above semantics, and the semantic feature value may be expressed as { "intent": "play _ music", "singer": the method includes the steps of marking by Liu Dehua, further, when a semantic processing module continues to receive text information of 'forgetting to do water', extracting a semantic feature value 'song' as 'forgetting to do water', accumulating the semantic feature value 'song' into the semantic meaning, and using { 'intent': "play _ music", "singer": "Liu De Hua", "song": "forget to water" }.

As shown in fig. 4, each time the semantic processing module extracts a semantic feature value, the semantic feature value may be scored, for example, when the text information received by the semantic processing module is "i want to hear", the semantic feature value extracted by the semantic processing module is { "intent": "palette _ music", but this semantic is less complete, so the semantic feature value score does not exceed the threshold, and therefore this semantic is not passed to the application module. When the semantic processing module receives the "liu de hua" text information, the semantic processing module extracts "singer" ═ liu de hua ", and then the semantic processing module pairs {" intent ": "play _ music", "singer": the semantic score of "Liu Dehua" will be greater than the threshold, at which time the semantic processing module will: "play _ music", "singer": "Liu De Hua" } to the application module.

And the application program module receives the semantic feature value corresponding to each section of voice signal in real time and adjusts the output content according to the received semantic feature value each time. That is, when the application module receives the semantic feature value { "intent": "play _ music", "singer": "Liu De Hua", will show the interface of preparing to play Liu De Hua's song at the application program interface.

In some embodiments of the present application, the receiving, by the application program module, a semantic feature value corresponding to each segment of the speech signal in real time, and adjusting the output content according to the received semantic feature value each time includes:

after receiving the semantic feature values, the application program module obtains commands corresponding to the semantic feature values according to semantic feature value matching, and switches to an application interface of the commands to be executed so as to output interface contents corresponding to the semantic feature values through the application interface. Wherein, when the semantic feature value received by the application program module is { "intent": "play _ music", "singer": when the song is played, the application module displays the interface of the song to be played in the application interface according to the semantic, but the song name is not available, so that the song to be played is the other song of the song to be played in the Liu De Hua. When the semantic processing module continues to extract the semantic feature value "song" - "forgetting water", obtaining { "intent": "play _ music", "singer": "Liu De Hua", "song": and the score of the semantics exceeds a threshold value, and when the score is sent to the application program module, the application program module changes the interface for playing the music from any song for playing Liudebua to the forgetting water for playing Liudebua.

Step S250, when the voice detection module detects that the voice signal is ended, the application program module executes a command corresponding to the semantic feature value of the complete segment of voice signal, where the complete segment of voice signal is the voice signal within a period from when the detection module starts timing from the detection of the voice signal to when the voice signal is ended. That is, when the voice detection module detects the end of the voice signal, the application module starts to execute the command, for example, in step S250, the interface of the application module is already at the playing interface of the liudeluxe forgetting water, and when the voice detection module detects the end of the voice signal when the user stops speaking, the application module starts to play the liudeluxe forgetting water.

In some embodiments of the present application, the semantic feature value includes a unique primary key value, the primary key value identifying a first intent of the speech signal. For example, in the above-described embodiment, "intent" is a primary key value of the semantic feature value, and thus, it can be known that the user wants to listen to music.

In some embodiments of the present application, the semantic feature values may also include N secondary key values (N ≧ 0) used to further define the range of first intent of the primary key value. For example, in the above-described embodiment, "singer" is "liu de hua" and "song" is "forgetful water" is an auxiliary key value of the semantic feature value, and thus, it is possible to know that the user specifically wants to listen to the song forgetful water of liu de hua.

In some embodiments of the present application, the semantic processing module employs any one of an LSTM, CNN, or Transformer model, where the LSTM, CNN, or Transformer model is a mature neural model and is easy to train, and the semantic processing module can quickly and accurately extract semantic feature values through the LSTM, CNN, or Transformer model.

Therefore, the incremental semantic processing method can be used for paralleling the voice recognition module, the semantic processing module and the application program module, shortening the processing time of voice signals, quickly understanding the voice signal content of a user, giving feedback, reducing the time delay in the voice interaction process and improving the voice interaction experience of the user.

It is noted that, in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or method. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or method that comprises the element.

The foregoing is a preferred embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and refinements can be made without departing from the principle described in the present application, and these modifications and refinements should be regarded as the protection scope of the present application.

Claims

1. An incremental semantic processing method is applied to an electronic device, and is characterized in that the electronic device comprises a voice detection module, a voice recognition module and a semantic processing module, and the method comprises the following steps:

the voice recognition module converts the voice signals received according to the time sequence into segmented text information in real time;

the voice recognition module sequentially sends the text information of each section of the voice signals to the semantic processing module according to the time sequence, the semantic processing module receives the text information of each section of the voice signals in real time, and when the semantic processing module receives the text information of each section of the voice signals, the semantic feature value of the text information of each section of the voice signals is extracted, so that the voice signals are processed in parallel with the voice recognition module.

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein the receiving, by the application module, the semantic feature value corresponding to each segment of the speech signal in real time, and adjusting the output content according to the semantic feature value received each time comprises:

4. The method according to any of claims 1-3, wherein when the speech detection module detects the end of the speech signal, the application module executes a command corresponding to a semantic feature value of a complete segment of the speech signal, the complete segment of the speech signal being the speech signal within a period of time from when the detection module started timing the detection of the speech signal to when the speech signal ended.

5. The method of claim 1, wherein the semantic feature value comprises a unique primary key value, and wherein the primary key value is used to identify the first intent of the speech signal.

6. The method of claim 5, wherein the semantic feature value further comprises N secondary key values (N ≧ 0) used to further define the range of the first intent of the primary key value.

7. The method of claim 6, wherein the primary key value and the secondary key value of the voice information value are updated in real time according to the voice signal.

8. The method of claim 1, wherein the finest granularity information of the text information is a single chinese character or a single word.

9. The method of claim 1, wherein the storage type of the semantic feature value is a json data type.

10. The method of claim 1, wherein the semantic processing module 103 employs any one of LSTM, CNN or Transformer models.