CN112765335A

CN112765335A - Voice calling landing system

Info

Publication number: CN112765335A
Application number: CN202110107551.8A
Authority: CN
Inventors: 金鑫嘉; 何晓光; 翁彬; 刘德辉; 姚飞
Original assignee: Shanghai Mitsubishi Elevator Co Ltd
Current assignee: Shanghai Mitsubishi Elevator Co Ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2021-05-07
Anticipated expiration: 2041-01-27
Also published as: CN112765335B

Abstract

The invention discloses a voice calling landing system.A voice acquisition device is used for acquiring an audio signal and sending the audio signal to a voice recognition device; the voice recognition device judges that effective calling instruction keywords exist in the audio signal when the calling instruction keywords exist in the audio signal and the maximum amplitude value of the calling instruction keywords in the audio signal is larger than a trigger threshold value; if the effective calling instruction keywords exist in the audio signals and the audio signal amplitudes in first set time before and after the calling instruction keywords are all smaller than the voice threshold, sending corresponding calling instructions to the microprocessor; otherwise, not sending the calling instruction to the microprocessor; the speech threshold is less than or equal to the trigger threshold. The invention discloses another two voice calling systems. The voice call system provided by the invention can avoid the false triggering of the call instruction.

Description

Voice calling landing system

Technical Field

The invention relates to an elevator, in particular to a voice calling system.

Background

To reduce the false triggering probability of speech, a valid command is considered only when the speech volume reaches a certain threshold. In order to avoid false triggering, a voice calling device of an elevator is triggered only when the voice volume of a keyword reaches a certain threshold value. However, a fixed threshold is difficult to adapt to different situations due to different noise levels of the field environment. The threshold value is set under the quiet condition, and the condition of false triggering still exists under the relatively noisy condition; a threshold is set in a noisy environment, and a correct command may not be triggered in a quiet environment. As shown in fig. 1.

Although normal triggering can be performed when no voice input is made before or after the keyword as shown in fig. 2, false triggering may be performed when voice input is made before or after the keyword as shown in fig. 3. When people are chatting, if the sentences contain keyword information, the keywords can be captured by a voice device and cause false triggering. For example, in a voice call device of an elevator, keywords as control commands are generally "go up/down" or "i go up/down". Because the keywords are short, only two keywords of 'up' and 'down' are different, and the two keywords are similar in pronunciation, when people chat in the call site, if the sentences contain keyword information, the keywords can be captured by the voice device and cause false triggering, so that the voice call device is frequently subjected to false triggering when used in the call site (say that the user goes upstairs and goes downstairs to identify the user as going upstairs). In an elevator application scenario, "false triggering" is a worse case scenario than "no response".

Disclosure of Invention

The invention aims to provide a voice calling system which can avoid the mistaken triggering of calling instructions.

In order to solve the technical problem, the invention provides a voice calling system, which comprises a voice acquisition device, a voice recognition device and a microprocessor;

the voice acquisition device is used for acquiring audio signals and sending the audio signals to the voice recognition device;

the voice recognition device judges that effective calling instruction keywords exist in the audio signal when the calling instruction keywords exist in the audio signal and the maximum amplitude value of the calling instruction keywords in the audio signal is larger than a trigger threshold value;

if the effective calling instruction keywords exist in the audio signals and the audio signal amplitudes in first set time before and after the calling instruction keywords are all smaller than the voice threshold, sending corresponding calling instructions to the microprocessor; otherwise, not sending the calling instruction to the microprocessor;

the speech threshold is less than or equal to the trigger threshold.

Preferably, the voice recognition device receives the audio signal collected by the voice collection device, and takes the amplitude average value of the audio signal received in the previous second set time as the environment volume average value, and then increases the fixed offset volume on the basis of the environment volume average value as the current trigger threshold, and the second set time is longer than the first set time;

and when the voice recognition device recognizes that the call instruction keywords exist in the audio signals and the amplitude of the audio signals of the call instruction keywords is larger than the current trigger threshold, judging that the effective call instruction keywords exist in the audio signals.

Preferably, the second set time is between 5 seconds and 30 minutes.

Preferably, the speech threshold is less than or equal to 1/4 of the trigger threshold.

In order to solve the technical problem, the other voice calling system provided by the invention comprises a voice acquisition device, a voice recognition device and a microprocessor;

the voice recognition device is used for obtaining n keyword feature scores after the audio signal is recognized and calculated, wherein n is the number of keywords which can be recognized by the voice recognition device, and n is an integer greater than 1; the keyword feature score value represents the confidence rate of the corresponding keyword, and the higher the score value is, the higher the probability that the audio signal is identified as the keyword is;

when the highest value of the n keyword feature scores is greater than a first set feature threshold value and the difference value of the feature score values of the two keywords with the highest scores is greater than or equal to a second set feature threshold value, taking one keyword with the highest keyword feature score value as an effective call instruction keyword obtained through identification and calculation, wherein the second set feature threshold value is smaller than the first set feature threshold value;

when the effective calling call instruction keywords are identified from the audio signals, the voice identification device sends corresponding calling call instructions to the microprocessor; otherwise no call command is sent to the microprocessor.

Preferably, the voice recognition device is provided with a calling instruction keyword sample set;

the call instruction keyword sample set comprises n samples of call instruction keywords;

and the voice recognition device recognizes and calculates n call instruction keyword feature scores according to the matching degree of the audio signal and n call instruction keyword samples in the call instruction keyword sample set.

the voice recognition device is used for obtaining k keyword feature scores after the audio signal is recognized and calculated, wherein k is the number of keywords which can be recognized by the voice recognition device, and is an integer larger than 1; the keyword feature score value represents the confidence rate of the corresponding keyword, and the higher the score value is, the higher the probability that the audio signal is identified as the keyword is;

when the keyword with the highest keyword feature score value is not a junk keyword, taking the keyword with the highest keyword feature score value as an effective call instruction keyword obtained by identification and calculation, and sending a corresponding call instruction to the microprocessor;

and when the keyword with the highest keyword feature score value is a junk keyword, not sending a call instruction to the microprocessor.

the call instruction keyword sample set comprises i samples of standard call instruction keywords and j samples of junk keywords, wherein i and j are positive integers, and k is i + j;

and the voice recognition device recognizes and calculates k keyword feature scores according to the matching degree of the audio signal and k call instruction keyword samples in the call instruction keyword sample set.

Preferably, when the keyword with the highest feature score value of the keyword is not a spam keyword, if the feature score value of the keyword is greater than a first set feature threshold value, the keyword with the highest feature score value of the keyword is taken as an effective call instruction keyword obtained by identification and calculation, and a corresponding call instruction is sent to the microprocessor; otherwise no call command is sent to the microprocessor.

Preferably, after receiving the calling instruction, the microprocessor sends the information related to the calling instruction to the elevator controller;

and the elevator controller controls the elevator to run according to the call instruction related information and the elevator running state.

Preferably, the voice calling system is a landing voice calling system;

the voice acquisition device is used for acquiring a landing audio signal and sending the landing audio signal to the voice recognition device;

the call instruction keywords include "up" and "down".

Preferably, the voice calling system is an in-car voice calling system;

the voice acquisition device is used for acquiring audio signals in the car and sending the audio signals to the voice recognition device;

the call instruction keywords comprise at least one of a star building, a door opening and a door closing, and the star is a floor number.

The voice call system can effectively avoid the false triggering of the call instruction.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the present invention are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of voice call system triggering with fixed trigger thresholds;

FIG. 2 is a schematic diagram of normal triggering of a voice call system without voice input before and after a keyword;

FIG. 3 is a schematic diagram of a voice call system false triggering with voice input before and after a keyword;

FIG. 4 is a schematic diagram of a voice call system architecture;

FIG. 5 is a schematic illustration of the operation of the voice recognition device of a first embodiment of the voice call system of the present invention;

FIG. 6 is a schematic diagram of normal triggering without voice input before and after a period of a call instruction keyword;

FIG. 7 is a schematic diagram of a call instruction keyword being not triggered by a voice input before and after a period of time;

FIG. 8 is a dynamic schematic diagram of the change of the trigger threshold of the voice call system of the present invention;

FIG. 9 is a schematic illustration of the operation of the voice recognition device of the second embodiment of the voice call system of the present invention;

FIG. 10 is a schematic illustration of the operation of the voice recognition device of a third embodiment of the voice call system of the present invention;

fig. 11 is a schematic view of the operating principle of the voice recognition device of the fourth embodiment of the voice call system of the present invention.

Detailed Description

The technical solutions in the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

As shown in fig. 4, the voice call system includes a voice acquisition device (MIC)), a voice recognition device, and a Microprocessor (MCU);

as shown in fig. 5, when a call instruction keyword is identified in the audio signal and the maximum amplitude of the call instruction keyword in the audio signal is greater than the trigger threshold, the voice recognition device determines that an effective call instruction keyword exists in the audio signal;

the speech threshold is less than or equal to the trigger threshold.

In the voice call system in the first embodiment, when no voice input is given before and after the call instruction keyword (the amplitude of the audio signal is smaller than the voice threshold), as shown in fig. 6, it is determined that an effective call instruction keyword exists in the audio signal, and a corresponding call instruction is sent to the microprocessor; when voice input exists before and after the call instruction keyword (the amplitude of the audio signal reaches or exceeds the voice threshold), as shown in fig. 7, it is judged that no effective call instruction keyword exists in the audio signal, and no corresponding call instruction is sent to the microprocessor. The voice call system of the first embodiment adds the condition of no voice input before and after the call instruction keyword in the call instruction keyword triggering condition, that is, after the call instruction keyword is captured, firstly, whether voice input exists in a period of time before and after the call instruction keyword is judged, and then, a related call calling function is triggered after no voice input is confirmed, so that the call instruction voice triggering condition is more strict, and the condition that the call instruction is triggered by mistake due to the fact that the keyword is captured in a sentence can be effectively avoided.

Example two

Based on the voice call system of the first embodiment, the voice recognition device receives the audio signal acquired by the voice acquisition device, and takes the average value of the amplitude of the audio signal received within the second setting time before as the average value of the environmental volume, and then increases the fixed offset volume on the basis of the average value of the environmental volume as the current trigger threshold, as shown in fig. 8, the second setting time is longer than the first setting time;

when the voice recognition device recognizes that the call instruction keywords exist in the audio signal and the amplitude of the audio signal of the call instruction keywords is larger than the current trigger threshold, it is determined that the valid call instruction keywords exist in the audio signal, as shown in fig. 9.

Preferably, the second set time is between 5 seconds and 30 minutes, and the second set time is determined according to an installation place of the elevator.

The voice calling system in the second embodiment changes the triggering threshold value of the calling instruction keyword in the audio signal into dynamic one, dynamically adjusts according to different environment volume mean values, regards voice as invalid voice and continues to collect when the voice does not reach the triggering threshold value, extracts continuous audio signal stream containing voice and carries out next processing when the voice collected by the voice collecting device reaches the triggering threshold value, and therefore the purpose of reducing false triggering can be met, and adaptation to different application scenes can be guaranteed.

EXAMPLE III

As shown in fig. 4, the voice calling system includes a voice collecting device, a voice recognition device, and a microprocessor;

as shown in fig. 10, the speech recognition device performs recognition calculation on the audio signal to obtain n keyword feature scores, where n is the number of keywords that can be recognized by the speech recognition device, and n is an integer greater than 1; the keyword feature score value represents the confidence rate of the corresponding keyword, and the higher the score value is, the higher the probability that the audio signal is identified as the keyword is;

In the voice call system of the third embodiment, for the case that some keywords have similar pronunciations and are frequently triggered by mistake in field use (say that going upstairs is recognized as going downstairs and say that going downstairs is recognized as going upstairs), a scoring strategy is adopted, and as shown in fig. 10, an audio signal is recognized and calculated by a voice recognition device to obtain n keyword feature scores; when the highest value of the n keyword feature scores is larger than a first set feature threshold value, the probability that the audio signal is identified as the corresponding keyword is larger, then further processing is carried out, otherwise, no processing is carried out; then selecting two keyword feature score values with the highest scores and calculating a difference value; when the difference value does not reach the second set characteristic threshold value, the obtained audio signal is regarded as invalid and the processing is finished; and when the difference value reaches a second set characteristic threshold value, taking a keyword with the highest keyword characteristic score value as an effective call instruction keyword obtained by identification and calculation, and carrying out the next processing. The voice call system of the third embodiment adopts a scoring strategy, so that the situation of false recognition caused by small overall difference of the feature scoring values of the keywords is avoided, the situation of false recognition caused by similar pronunciation can be effectively reduced, and the accuracy rate of voice call instruction recognition is improved.

Example four

as shown in fig. 11, after the audio signal is identified and calculated, the speech recognition device obtains k keyword feature scores, where k is the number of keywords that can be identified by the speech recognition device, and k is an integer greater than 1; the keyword feature score value represents the confidence rate of the corresponding keyword, and the higher the score value is, the higher the probability that the audio signal is identified as the keyword is;

Canonical call instruction keywords are standard words commonly used for passenger calls. The junk keywords are words which are similar to the standard call instruction keywords but do not meet the standard requirements.

In the voice call system according to the fourth embodiment, as shown in fig. 11, when the keyword with the highest keyword feature score value is a spam keyword, the processing is ended, and there is no call instruction; and if the keyword with the highest keyword feature score value is not the junk keyword, generating a corresponding call instruction according to the keyword with the highest keyword feature score value.

EXAMPLE five

Based on the voice call system of the first, second, third and fourth embodiments, as shown in fig. 4, after receiving the call instruction, the microprocessor sends the information related to the call instruction to the elevator controller;

EXAMPLE six

Based on the voice calling systems of the first embodiment, the second embodiment, the third embodiment, the fourth embodiment and the fifth embodiment, the voice calling system is a landing voice calling system;

the call instruction keywords include "up" and "down".

In a landing voice call system, a standard call instruction keyword is generally 'going upstairs/downstairs' or 'i going upstairs/i going downstairs', and the like. However, when a passenger actually uses a voice call device, the passenger may speak "up/down" and the like unsatisfactory words, and the words are classified as spam keywords.

EXAMPLE seven

Based on the voice calling systems of the first embodiment, the second embodiment, the third embodiment, the fourth embodiment and the fifth embodiment, the voice calling system is an in-car voice calling system;

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A voice calling landing system is characterized by comprising a voice acquisition device, a voice recognition device and a microprocessor;

the speech threshold is less than or equal to the trigger threshold.

2. Voice call system according to claim 1,

the voice recognition device receives the audio signals collected by the voice collection device, takes the amplitude average value of the audio signals received in the second set time as the environment volume average value, then increases the fixed offset volume on the basis of the environment volume average value as the current trigger threshold value, and the second set time is longer than the first set time;

3. Voice call system according to claim 2,

the second set time is between 5 seconds and 30 minutes.

4. Voice call system according to claim 1,

the speech threshold is less than or equal to 1/4 of the trigger threshold.

5. Voice call system according to claim 1,

after receiving the calling instruction, the microprocessor sends the information related to the calling instruction to an elevator controller;

6. Voice call system according to claim 1,

the voice calling system is a landing voice calling system;

the call instruction keywords include "up" and "down".

7. Voice call system according to claim 1,

the voice calling system is a voice calling system in the car;

8. A voice calling landing system is characterized by comprising a voice acquisition device, a voice recognition device and a microprocessor;

9. Voice call system according to claim 8,

the voice recognition device is provided with a calling instruction keyword sample set;

10. Voice call system according to claim 8,

11. Voice call system according to claim 8,

the voice calling system is a landing voice calling system;

the call instruction keywords include "up" and "down".

12. Voice call system according to claim 8,

the voice calling system is a voice calling system in the car;

13. A voice calling landing system is characterized by comprising a voice acquisition device, a voice recognition device and a microprocessor;

14. Voice call system according to claim 13,

15. Voice call system according to claim 13,

when the keyword with the highest feature score value of the keyword is not a spam keyword, if the feature score value of the keyword is greater than a first set feature threshold value, taking the keyword with the highest feature score value of the keyword as an effective call instruction keyword obtained by identification and calculation, and sending a corresponding call instruction to the microprocessor; otherwise no call command is sent to the microprocessor.

16. Voice call system according to claim 13,

17. Voice call system according to claim 13,

the voice calling system is a landing voice calling system;

the call instruction keywords include "up" and "down".

18. Voice call system according to claim 13,

the voice calling system is a voice calling system in the car;