CN114937453A - Voice recognition method, system, equipment and storage medium based on hotel scene - Google Patents

Voice recognition method, system, equipment and storage medium based on hotel scene Download PDF

Info

Publication number
CN114937453A
CN114937453A CN202210539264.9A CN202210539264A CN114937453A CN 114937453 A CN114937453 A CN 114937453A CN 202210539264 A CN202210539264 A CN 202210539264A CN 114937453 A CN114937453 A CN 114937453A
Authority
CN
China
Prior art keywords
hotel
audio
voiceprint
environment audio
staff
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210539264.9A
Other languages
Chinese (zh)
Inventor
叶帅
刘晓雷
王长春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Huake Information Technology Co ltd
Original Assignee
Shanghai Huake Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Huake Information Technology Co ltd filed Critical Shanghai Huake Information Technology Co ltd
Priority to CN202210539264.9A priority Critical patent/CN114937453A/en
Publication of CN114937453A publication Critical patent/CN114937453A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The invention provides a voice recognition method, a system, equipment and a storage medium based on hotel scenes, wherein the method comprises the following steps: establishing a voiceprint feature library of hotel employees; respectively arranging first radio equipment at hotel preset positions and acquiring first environment audio; extracting voiceprint features of the first environment audio, and matching nearby hotel employees as event service employees to the area where the first radio equipment is located when the voiceprint features of the hotel employees do not hit; when the voiceprint characteristics of the event service staff are detected in the voiceprint characteristics of the environmental audio, starting a wireless radio module of intelligent wearable equipment of the event service staff, and acquiring a second environmental audio and sending the second environmental audio to a hotel server; and performing voice recognition based on the voiceprint characteristics on the second environment audio to generate an event dialog text. The invention can carry out all-round voice monitoring and real-time analysis of the hotel, and greatly improves the identification and separation of the speaker, the identification precision and the recall rate.

Description

Voice recognition method, system, equipment and storage medium based on hotel scene
Technical Field
The invention relates to the field of hotel management, in particular to a voice recognition method, a system, equipment and a storage medium based on hotel scenes.
Background
The existing intelligent radio equipment in the market comprises an intelligent sound pick-up, an intelligent work card and the like. The sound pickup can collect the sound of open scene, and the sound recorder tablet can record the sound of single service personnel and carry out quality control. However, no device or product for recording and analyzing the quality of service aiming at open scenes exists in the market at present. Moreover, the sound pickup or the intelligent card on the market has the following problems:
(1) the existing sound pickup needs to be connected with wireless wifi and used in the environment with a wireless network, and the use scene is severely limited.
(2) If the card is a work card, the recording is uploaded intelligently and offline because the volume is limited; i.e. recording for several hours, and then uploading the recording while charging. This does not allow real-time analysis.
(3) The existing sound pick-up equipment generally records sound continuously, and the recording volume is very large after a long time, so that the storage space is wasted. Significant transcription costs are also incurred if speech transcription analysis is to be performed for each recording. The intelligent work card also has the problem of cost consumption, and uninterrupted recording cannot be achieved due to the fact that charging is needed. Moreover, often times, the guest's emotional runaway is likely to be left unrefed for a long time, and thus quarreling may occur. However, the intelligent work card has limited electric quantity, and cannot be opened at any time to record and analyze reasons.
(4) The sound recording of the existing sound pick-up can not be subjected to further analysis such as comprehensive emotion analysis and service quality inspection, and is generally only stored.
These reasons are all the difficulties and pain points in the digital modification of hotel services.
Therefore, the invention provides a voice recognition method, a system, equipment and a storage medium based on hotel scenes.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide a hotel scene-based voice recognition method, a hotel scene-based voice recognition system, hotel scene-based voice recognition equipment and a hotel scene-based voice recognition storage medium, which overcome the difficulties in the prior art, can perform all-around voice monitoring and real-time analysis of a hotel, can recognize and separate the identity of a speaker in a noisy environment, correspond to the analysis result of a specific person, greatly reduce the labor cost and greatly improve the recognition accuracy and the recall rate.
The embodiment of the invention provides a voice recognition method based on hotel scenes, which comprises the following steps:
establishing a voiceprint feature library of hotel employees;
respectively arranging first radio equipment at preset positions of hotels to collect first environment audio;
extracting voiceprint features of the first environment audio, and matching the hotel staff nearby as an event service staff to go to the area of the first radio equipment when the voiceprint features of the hotel staff are not hit;
when the voiceprint characteristics of the event service staff are detected in the voiceprint characteristics of the environmental audio, starting a wireless radio module of intelligent wearable equipment of the event service staff, and acquiring a second environmental audio and sending the second environmental audio to a hotel server; and
and performing voice recognition based on voiceprint characteristics on the second environment audio to generate an event dialog text.
Preferably, the establishing of the voiceprint feature library of the hotel employee comprises:
collecting voiceprint characteristics of each hotel employee;
and establishing a first mapping relation table of the serial number of the intelligent wearable device of each hotel employee and the voiceprint characteristics of the hotel employees.
Preferably, the method for setting a first radio device at each preset hotel position to collect a first environmental audio includes:
arranging first radio equipment at each position of the hotel, wherein the first radio equipment collects first environment audio; and
and establishing a second mapping relation table of each first radio equipment and the position.
Preferably, the extracting the voiceprint features of the first environment audio, and when the voiceprint features of the hotel staff are not hit in the voiceprint features, matching the hotel staff nearby as an event service staff to the area where the first radio equipment is located includes:
the first radio equipment extracts the voiceprint characteristics of the first environment audio;
and when the voiceprint features of the hotel staff hit one voiceprint feature, based on the position of the first radio equipment for collecting the first environment audio, at least one hotel staff is matched nearby to serve as an event service staff, a service instruction is generated to the intelligent wearable equipment of the event service staff, and the intelligent wearable equipment guides the user to the position of the first radio equipment for collecting the first environment audio.
Preferably, when the voiceprint feature of the event service employee is detected in the voiceprint features of the environmental audio, the method starts a wireless radio module of the intelligent wearable device of the event service employee, collects a second environmental audio, and sends the second environmental audio to the hotel server, and includes:
the first radio equipment continues to extract the voiceprint features of the first environment audio;
when the voiceprint features of the event service staff are hit, sending a reception instruction to intelligent wearable equipment of the event service staff; and the intelligent wearable device collects second environment audio and sends the second environment audio to the hotel server.
Preferably, the voice recognition based on the voiceprint feature is performed on the second environmental audio to generate an event dialog text, and the method includes:
extracting a first voice fragment of the event service staff from the second environment audio based on the voiceprint characteristics of the event service staff, and performing voice recognition to obtain a first conversation fragment;
performing voice recognition on the remaining second voice segment in the second environment audio to obtain a second dialogue segment;
generating a context dialog text of the event service employee with the hotel guest based on the sequential arrangement of the first and second dialog segments in the second ambient audio; and
and generating a hotel service task at least based on the content of the conversation text.
Preferably, the establishing of the second mapping relationship table between each first radio device and the position thereof includes:
establishing a second mapping relation table of each first radio device, at least one preset hotel service task and the position of the first radio device;
the generating a hotel service task based on at least the content of the dialog text comprises:
inputting the content of the conversation text and a preset hotel service task corresponding to the first radio equipment in the second mapping relation table into a trained hotel task generation model;
based on the relevance between the conversation text and the preset hotel service task, increasing the confidence of the preset hotel service task in the hotel task generation model;
outputting a hotel service task with the highest confidence degree according to the confidence degree sequencing result; and
and sending the hotel service task to a corresponding hotel service department.
The embodiment of the present invention further provides a voice recognition system based on hotel scenes, which is used for implementing the voice recognition method based on hotel scenes, and the voice recognition system based on hotel scenes comprises:
the voice print characteristic collection module is used for establishing a voice print characteristic library of the hotel staff;
the system comprises a first environment audio module, a second environment audio module and a third environment audio module, wherein the first environment audio module is used for respectively arranging first radio equipment at preset positions of hotels and collecting first environment audio;
the event service starting module is used for extracting the voiceprint features of the first environment audio, and when the voiceprint features of the hotel staff are not hit, the hotel staff are matched nearby to serve as event service staff to go to the area where the first radio equipment is located;
the second environment audio module is used for starting a wireless radio module of intelligent wearable equipment of the event service staff when the voiceprint characteristics of the event service staff are detected in the voiceprint characteristics of the environment audio, acquiring the second environment audio and sending the second environment audio to the hotel server; and
and the event dialog text module is used for carrying out voice recognition based on the voiceprint characteristics on the second environment audio to generate an event dialog text.
The embodiment of the invention also provides a voice recognition device based on hotel scenes, which comprises:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the hotel scenario-based speech recognition method described above via execution of the executable instructions.
An embodiment of the present invention further provides a computer-readable storage medium for storing a program, where the program implements the steps of the above-mentioned voice recognition method based on hotel scenes when executed.
The invention aims to provide a hotel scene-based voice recognition method, a hotel scene-based voice recognition system, hotel scene-based voice recognition equipment and a hotel scene-based voice recognition storage medium, which can perform all-around voice monitoring and real-time analysis of a hotel, can recognize and separate the identity of a speaker in a noisy environment, correspond to the analysis result of a specific person, greatly reduce the labor cost and greatly improve the recognition accuracy and the recall rate.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.
Fig. 1 is a flow chart of the hotel scenario-based speech recognition method of the present invention.
Fig. 2 to 7 are schematic diagrams of implementation processes of the speech recognition method based on hotel scenes.
Figure 8 is a block schematic diagram of the hotel scenario based speech recognition system of the present invention.
Fig. 9 is a schematic structural diagram of a hotel scene-based speech recognition device of the present invention.
Fig. 10 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. The present application is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present application. It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict.
Embodiments of the present application will be described in detail below with reference to the accompanying drawings so that those skilled in the art to which the present application pertains can easily carry out the present application. The present application may be embodied in many different forms and is not limited to the embodiments described herein.
In the expressions of the present application, reference to expressions of "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics illustrated may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of different embodiments or examples presented in this application can be combined and combined by those skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of the feature. In the expressions of the present application, "plurality" means two or more unless specifically defined otherwise.
In order to clearly explain the present application, components that are not related to the description are omitted, and the same reference numerals are given to the same or similar components throughout the specification.
Throughout the specification, when a device is referred to as being "connected" to another device, this includes not only the case of being "directly connected" but also the case of being "indirectly connected" with another element interposed therebetween. In addition, when a device "includes" a certain component, unless otherwise stated, the device does not exclude other components, but may include other components.
When a device is said to be "on" another device, this may be directly on the other device, but may also be accompanied by other devices in between. When a device is said to be "directly on" another device, there are no other devices in between.
Although the terms first, second, etc. may be used herein to describe various elements in some instances, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first interface, a second interface, etc. Also, as used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, steps, operations, elements, components, items, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the singular forms "a", "an" and "the" include plural forms as long as the words do not expressly indicate a contrary meaning. The term "comprises/comprising" when used in this specification is taken to specify the presence of stated features, regions, integers, steps, operations, elements, and/or components, but does not exclude the presence or addition of other features, regions, integers, steps, operations, elements, and/or components.
Although not defined differently, including technical and scientific terms used herein, all terms have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. Terms defined in commonly used dictionaries are to be additionally interpreted as having meanings consistent with those of related art documents and the contents of the present prompts, and must not be excessively interpreted as having ideal or very formulaic meanings unless defined.
Fig. 1 is a flowchart of the hotel scene-based speech recognition method of the present invention. As shown in fig. 1, an embodiment of the present invention provides a voice recognition method based on hotel scenes, including the following steps:
and S110, establishing a voiceprint feature library of the hotel staff.
S120, respectively arranging first radio equipment at each preset position of the hotel and collecting first environment audio.
S130, extracting the voiceprint features of the first environment audio, and when the voiceprint features of the hotel staff are not hit, matching the area where the hotel staff go to the first radio equipment as an event service staff nearby.
And S140, when the voiceprint features of the event service staff are detected in the voiceprint features of the environment audio, starting a wireless radio module of intelligent wearable equipment of the event service staff, and acquiring a second environment audio and sending the second environment audio to a hotel server. The intelligent wearable equipment is an intelligent chest card which is used for hotel staff and has a network transmission function, a positioning function inside a hotel and an audio recording and playing function. And
and S150, performing voice recognition based on the voiceprint characteristics on the second environment audio to generate an event dialog text.
In a preferred embodiment, step S110 includes:
and S111, collecting voiceprint characteristics of each hotel employee.
S112, establishing a first mapping relation table of the serial number of the intelligent wearable device of each hotel employee and the voiceprint characteristics of the hotel employees.
In a preferred embodiment, step S120 includes:
s121, arranging first radio equipment at each position of the hotel, wherein the first radio equipment collects first environment audio. And
s122, establishing a second mapping relation table of each first radio device and the position of each first radio device.
In a preferred embodiment, step S130 includes:
s131, the first radio equipment extracts the voiceprint characteristics of the first environment audio.
S132, judging whether the voiceprint features of the hotel staff are hit, if so, executing the step S133. If not, go to step S135.
And S133, at least one hotel employee is matched as an event service employee nearby based on the position of the first radio equipment for collecting the first environment audio.
S134, generating a service instruction to the intelligent wearable device of the event service staff, guiding the intelligent wearable device to the position of the first radio device for collecting the first environment audio, and executing the step S140. And
and S135, ending.
In a preferred embodiment, step S140 includes:
s141, the first radio equipment continues to extract the voiceprint feature of the first environment audio.
And S142, judging whether the voiceprint features of the event service staff are hit or not, and if yes, executing a step S143. If not, the process returns to step S141.
And S143, sending a radio receiving instruction to the intelligent wearable device of the event service staff. And
s144, the intelligent wearable device collects second environment audio and sends the second environment audio to the hotel server.
In a preferred embodiment, step S150 includes:
and S151, extracting a first voice fragment of the event service staff from the second environment audio based on the voiceprint characteristics of the event service staff, and performing voice recognition to obtain a first conversation fragment.
And S152, performing voice recognition on the remaining second voice segment in the second environment audio to obtain a second dialogue segment.
And S153, arranging the first conversation fragment and the second conversation fragment in sequence in the second environment audio to generate a scene conversation text of the event service staff and the hotel guest. And
and S154, generating a hotel service task at least based on the content of the conversation text.
In a preferred embodiment, step S122 includes establishing a second mapping table of each first radio, at least one predetermined hotel service task, and location.
Step S154, including:
s1541, inputting the content of the conversation text and a preset hotel service task corresponding to the first radio equipment in the second mapping relation table into a trained hotel task generation model.
S1542, increasing the confidence of the preset hotel service task in the hotel task generating model based on the relevance of the dialog text and the preset hotel service task.
S1543, according to the confidence degree sequencing result, outputting a hotel service task with the highest confidence degree. And
s1544, the hotel service task is sent to a corresponding hotel service department.
The voice recognition method based on the hotel scene can perform all-around voice monitoring and real-time analysis of the hotel, can recognize and separate the identity of the speaker in a noisy environment, corresponds to the analysis result of a specific person, greatly reduces the labor cost, and greatly improves the recognition accuracy and the recall rate.
The improvement points of the invention are as follows:
1. SIM card is arranged in intelligent chest card of hotel staff, and communication state monitoring of Internet of things
2. Enabling intelligent VAD (Voice Activity detection)
3. Can realize human voice emotion detection and accurate ASR transcription (automatic speech recognition)
4. Multi-modal emotion analysis and service point analysis based on deep learning
The open type intelligent recording equipment provided by the invention integrates a plurality of technical bright spots. The first one is not limited to specific geographical positions and scenes, and real-time recording transmission is possible by plugging a 4G-SIM flow card in the equipment and powering on as long as a socket is available. Because recording equipment may be distributed throughout all provinces and cities of the country, the number of the recording equipment is huge, and the equipment is monitored and maintained through a professional Internet of things information platform.
If the device records for 24 hours and multiplies the number of devices, the amount of data generated is very large. The invention adopts an intelligent voice activity detection technology, namely, only when the voice of a person nearby is detected, the voice is recorded. If the background noise is any background noise of objects, music, animals and the like, the recording cannot be started. This reduces network consumption and the cost of voice transcription in the next step.
In a preferred embodiment, the device of the present invention triggers a recording if the presence of a human voice is detected, and the analysis is performed every 30 seconds of recording. Two types of information, audio information and language information, are mainly analyzed. The audio information mainly comprises emotion recognition and speech speed of human voice, wherein the emotion types comprise: mild, normal, violent and very violent. The speech rate includes: very slow, normal, fast, very fast. The current emotional state of the speaker can be basically judged through the two measurement values. Besides emotion recognition, voice recognition transcription is also carried out on the set recording, and a common transcription interface such as news flyers, hundredths and the like is adopted. The obtained transcribed characters can be in one-to-one correspondence by combining the previous emotion recognition results.
And the last step is to perform service quality inspection analysis on service personnel in a specific scene. The analyzed data source contains the audio and text bimodal information recorded by the previous equipment, and when the depth model is constructed, the following characteristics need to be calculated:
fbank characteristics and emotional characteristics of the speech segments.
The Embedding semantic feature of the transcription Chinese of the voice fragment.
Audio and text mixing features of the speech segments.
According to the fusion of the characteristics, the emotion and the business point topic of the speaker in the current time period can be analyzed in real time by training the neural network transducer model. Thereby detecting customer satisfaction and opinion with service personnel, the environment, associated facilities, and other associated service points. Therefore, the service quality inspection process under the fully-automatic open scene is realized.
The method for realizing service quality inspection according to the voice information of the open scene has wide application value, and can improve or replace the traditional customer service scene, because the equipment and the system are not limited to the monitoring of specific communication such as a telephone and a mobile phone any more, but the whole monitoring and radio reception of the open and developed scene. The invention has the advantages that:
1. the method is not limited by time, environment and personnel, and can be used for omnibearing monitoring and real-time analysis.
2. In the noisy environment of the person, the identification and separation of the identity of the speaker can be realized, and the analysis result corresponding to the specific person can be obtained. This is not possible with conventional speech analysis.
3. By means of recognition through a mode of training a bimodal deep learning model, a responsible keyword matching rule does not need to be designed and maintained, and labor cost is greatly reduced. And the recognition accuracy and the recall rate are greatly improved.
As shown in fig. 2, voiceprint characteristics are collected for each hotel employee 11, 12, 13, 14. And establishing a first mapping relation table of the serial number of the intelligent chest card 10 of each hotel employee and the voiceprint characteristics of the hotel employees. A plurality of first radio devices 21 and 22 … … 2N are arranged at each place of the hotel, the first radio devices 21 and 22 … … 2N are respectively connected to the hotel server 3, and the first radio devices collect first environment audio 31. The hotel server 3 establishes a second mapping relation table of each first radio device, at least one preset hotel service task and the position, wherein the first radio device 22 is arranged in a hotel lobby, the preset hotel service tasks corresponding to the first radio device 22 comprise ordering, urging, buying orders and the like, and the first radio devices are all radio devices powered by plugs. The voiceprint feature of the first environment audio 31 is extracted through the first radio equipment 22, when the voiceprint feature of a hotel employee is hit in the voiceprint feature, a hotel employee 11 is matched as an event service employee nearby based on the position of the first environment audio 31, a service instruction is generated for the intelligent chest card 10 with the functions of wireless network and audio recording of the event service employee, and the hotel employee 11 is guided to go to the position of the first environment audio 31.
As shown in fig. 3 and 4, the first radio equipment 22 continues to extract the voiceprint features of the first environmental audio 31, and when the voiceprint features of the event service staff hit, a radio instruction 33 is sent to the intelligent chest card 10 of the event service staff, and the intelligent chest card 10 collects the second environmental audio 32 (including the conversation between the hotel staff 11 and the hotel guest 15) and sends the second environmental audio to the hotel server 3. Therefore, the problem that the duration time is shortened because the battery-used intelligent chest card 10 needs to be always on for recording is avoided, and the intelligent chest card 10 is triggered only when the first radio equipment judges that an event occurs according to the environmental audio and the hotel staff 11 arrives at the scene.
As shown in fig. 5, 6, and 7, the hotel server 3 extracts a first speech fragment 41, 43, and 45 of the event service employee from the second environmental audio 32 based on the voiceprint feature of the event service employee, and performs speech recognition to obtain a first conversation fragment. Wherein the text of the first speech segment 41 is "hello". The text of the first speech fragment 43 is "sorry, what can be served by you". The text of the first speech segment 45 is "i help you take your time right away".
The remaining second speech segments 42, 44 are speech recognized in the second ambient audio 32 to obtain a second dialogue segment. The text of the second speech segment 42 is "totals for coming, i am waiting half a day". The text of the second speech segment 44 is "my meal has been for half an hour.
Based on the sequential arrangement of the first and second conversation segments in the second environmental audio 32, a contextual conversation text of the event service employee with the hotel guest is generated.
And finally, inputting the content of the conversation text and the preset hotel service task corresponding to the first radio equipment 22 in the second mapping relation table into the trained hotel task generating model, and increasing the confidence coefficient of the preset hotel service task in the hotel task generating model based on the relevance between the conversation text and the preset hotel service task (ordering, urging, buying a menu and the like). And outputting a hotel service task 'urging note' with the highest confidence degree according to the confidence degree sequencing result. The 'order urging' of the hotel service tasks is sent to the corresponding hotel kitchen, so that the information transmission process inside the hotel is greatly accelerated, and the experience of hotel guests is optimized.
Figure 8 is a block schematic diagram of the hotel scenario based speech recognition system of the present invention. As shown in fig. 8, an embodiment of the present invention further provides a voice recognition system based on hotel scenes, which is used to implement the above voice recognition method based on hotel scenes, and the voice recognition system based on hotel scenes includes:
and the voiceprint feature collection module 51 is used for establishing a voiceprint feature library of the hotel staff.
The first environment audio module 52 is configured to set a first radio device at each preset location of the hotel, and collect the first environment audio.
And the event service starting module 53 extracts the voiceprint features of the first environment audio, and when the voiceprint features of the hotel staff are not hit, the hotel staff are matched nearby and used as the event service staff to go to the area where the first radio equipment is located.
And the second environment audio module 54, when the voiceprint feature of the event service employee is detected in the voiceprint feature of the environment audio, starts a wireless radio module of the intelligent wearable device of the event service employee, collects the second environment audio and sends the second environment audio to the hotel server. And
the event dialog text module 55 performs voice recognition based on the voiceprint feature on the second environmental audio to generate an event dialog text.
In a preferred embodiment, the voiceprint feature collection module 51 is configured to collect voiceprint features for each hotel employee. And establishing a first mapping relation table of the serial number of the intelligent wearable equipment of each hotel employee and the voiceprint characteristics of the hotel employees.
In a preferred embodiment, the first ambient audio module 52 is configured to provide a first radio device throughout the hotel that captures the first ambient audio. And establishing a second mapping relation table of each first radio equipment and the position.
In a preferred embodiment, the event service initiation module 53 is configured to cause the first radio to extract voiceprint features of the first ambient audio. When the voiceprint characteristics of a hotel employee are hit in the voiceprint characteristics, at least one hotel employee is matched as an event service employee based on the position of the first radio equipment for collecting the first environment audio nearby, a service instruction is generated to the intelligent wearing equipment of the event service employee, and the intelligent wearing equipment is guided to go to the position of the first radio equipment for collecting the first environment audio.
In a preferred embodiment, the second ambient audio module 54 is configured to cause the first radio to continue extracting voiceprint features of the first ambient audio. And when the voiceprint characteristics of the event service staff are hit, sending a reception instruction to the intelligent wearable equipment of the event service staff. And the intelligent wearable equipment collects second environment audio and sends the second environment audio to the hotel server.
In a preferred embodiment, the event dialog text module 55 is configured to extract a first speech fragment of the event service employee from the second environmental audio based on the voiceprint characteristics of the event service employee and perform speech recognition to obtain the first speech fragment. And performing voice recognition on the remaining second voice fragment in the second environmental audio to obtain a second dialogue fragment. Generating a context dialog text of the event service employee with the hotel guest based on the sequential arrangement of the first dialog segment and the second dialog segment in the second environmental audio. A hotel services task is generated based at least on the content of the dialog text.
In a preferred embodiment, when the first environment audio module 52 establishes the second mapping relationship table of each first radio, at least one predetermined hotel service task and location. The event dialog text module 55 further inputs the content of the dialog text and the preset hotel service task corresponding to the first radio in the second mapping table into the trained hotel task generating model. And increasing the confidence coefficient of the preset hotel service task in the hotel task generation model based on the relevance of the dialog text and the preset hotel service task. And outputting a hotel service task with the highest confidence degree according to the confidence degree sequencing result. And sending the hotel service task to the corresponding hotel service department.
The voice recognition system based on the hotel scene can perform all-around voice monitoring and real-time analysis on the hotel, can recognize and separate the identity of the speaker in a noisy environment, corresponds to the analysis result of a specific person, greatly reduces the labor cost, and greatly improves the recognition accuracy and recall rate.
The embodiment of the invention also provides voice recognition equipment based on the hotel scene, which comprises a processor. A memory having stored therein executable instructions of the processor. Wherein the processor is configured to perform the steps of the hotel scenario based speech recognition method via execution of the executable instructions.
As shown above, the voice recognition system based on hotel scenes according to the embodiment of the present invention can perform comprehensive voice monitoring and real-time analysis of hotels, can identify and separate the identities of speakers in a noisy environment, and corresponds to the analysis result of specific people, thereby greatly reducing the labor cost, and greatly improving the accuracy and recall rate of identification.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.
Fig. 9 is a schematic structural diagram of a hotel scene-based speech recognition device of the present invention. An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 9. The electronic device 600 shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 9, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including the memory unit 620 and the processing unit 610), a display unit 640, etc.
Wherein the storage unit stores program code executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, processing unit 610 may perform the steps as shown in fig. 1.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.
The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.
The embodiment of the invention also provides a computer-readable storage medium for storing a program, and the steps of the voice recognition method based on the hotel scene are realized when the program is executed. In some possible embodiments, the aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of this specification, when the program product is run on the terminal device.
As shown above, the voice recognition system based on hotel scenes of the embodiment of the present invention can perform comprehensive voice monitoring and real-time analysis of hotels, can recognize and separate the identities of speakers in a noisy environment, and corresponds to the analysis result of specific people, thereby greatly reducing the labor cost, and greatly improving the recognition accuracy and the recall rate.
Fig. 10 is a schematic structural diagram of a computer-readable storage medium of the present invention. Referring to fig. 10, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this respect, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, the present invention is directed to provide a method, a system, a device and a storage medium for speech recognition based on hotel scenes, which can perform comprehensive speech monitoring and real-time analysis of hotels, recognize and separate the identity of speakers in a noisy environment, and correspond to the analysis result of specific people, thereby greatly reducing the labor cost, and greatly improving the recognition accuracy and the recall rate.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. A voice recognition method based on hotel scenes is characterized by comprising the following steps:
establishing a voiceprint feature library of hotel employees;
respectively arranging first radio equipment at preset positions of hotels to collect first environment audio;
extracting voiceprint features of the first environment audio, and matching the hotel staff nearby as an event service staff to go to the area of the first radio equipment when the voiceprint features of the hotel staff are not hit;
when the voiceprint characteristics of the event service staff are detected in the voiceprint characteristics of the environmental audio, starting a wireless radio module of intelligent wearable equipment of the event service staff, and acquiring a second environmental audio and sending the second environmental audio to a hotel server; and
and performing voice recognition based on the voiceprint characteristics on the second environment audio to generate an event dialog text.
2. The hotel scenario-based voice recognition method of claim 1, wherein establishing a voiceprint feature library for hotel employees comprises:
collecting voiceprint characteristics of each hotel employee;
and establishing a first mapping relation table of the serial number of the intelligent wearable device of each hotel employee and the voiceprint characteristics of the hotel employees.
3. The hotel scene-based voice recognition method as claimed in claim 2, wherein the step of respectively arranging a first radio device at each preset hotel position and acquiring a first environmental audio comprises the steps of:
arranging first radio equipment at each position of a hotel, wherein the first radio equipment collects first environment audio; and
and establishing a second mapping relation table of each first radio equipment and the position.
4. The hotel scenario-based voice recognition method of claim 3, wherein said extracting the voiceprint features of the first environmental audio, and when the voiceprint features of the hotel employee are not hit in the voiceprint features, matching the hotel employee as an event service employee to the area of the first radio receiver nearby comprises:
the first radio equipment extracts the voiceprint characteristics of the first environment audio;
and when the voiceprint features of the hotel staff hit one voiceprint feature, based on the position of the first radio equipment for collecting the first environment audio, at least one hotel staff is matched nearby to serve as an event service staff, a service instruction is generated to the intelligent wearable equipment of the event service staff, and the intelligent wearable equipment guides the user to the position of the first radio equipment for collecting the first environment audio.
5. The hotel scene-based voice recognition method as recited in claim 3, wherein the step of starting a wireless radio module of the intelligent wearable device of the event service employee when the voiceprint features of the environmental audio are detected, collecting a second environmental audio and sending the second environmental audio to the hotel server comprises:
the first radio equipment continues to extract the voiceprint features of the first environment audio;
when the voiceprint features of the event service staff are hit, sending a reception instruction to intelligent wearable equipment of the event service staff; and the intelligent wearable equipment acquires second environment audio and sends the second environment audio to the hotel server.
6. The hotel scenario-based speech recognition method of claim 3, wherein said voiceprint feature-based speech recognition of said second ambient audio to generate an event dialog text comprises:
extracting a first voice fragment of the event service staff from the second environment audio based on the voiceprint characteristics of the event service staff, and performing voice recognition to obtain a first conversation fragment;
performing voice recognition on the remaining second voice segment in the second environment audio to obtain a second dialogue segment;
generating a context dialog text of the event service employee with the hotel guest based on the sequential arrangement of the first and second dialog segments in the second ambient audio; and
and generating a hotel service task at least based on the content of the conversation text.
7. The hotel scenario-based speech recognition method of claim 3, wherein said establishing a second mapping relationship table between each of said first radio devices and a location comprises:
establishing a second mapping relation table of each first radio device, at least one preset hotel service task and the position of the first radio device;
the generating a hotel service task based on at least the content of the dialog text comprises:
inputting the content of the conversation text and a preset hotel service task corresponding to the first radio equipment in the second mapping relation table into a trained hotel task generation model;
based on the relevance between the dialog text and the preset hotel service task, increasing the confidence of the preset hotel service task in the hotel task generation model;
outputting a hotel service task with the highest confidence degree according to the confidence degree sequencing result; and
and sending the hotel service task to a corresponding hotel service department.
8. A hotel scene-based voice recognition system for implementing the hotel scene-based voice recognition method of claim 1, comprising:
the voice print characteristic collection module is used for establishing a voice print characteristic library of the hotel staff;
the hotel environment audio system comprises a first environment audio module, a second environment audio module and a third environment audio module, wherein the first environment audio module is provided with first radio equipment at each preset hotel position and collects first environment audio;
the event service starting module is used for extracting the voiceprint features of the first environment audio, and when the voiceprint features of the hotel staff are not hit, the hotel staff are matched nearby to serve as event service staff to go to the area where the first radio equipment is located;
the second environment audio module is used for starting a wireless radio module of intelligent wearable equipment of the event service staff when the voiceprint characteristics of the event service staff are detected in the voiceprint characteristics of the environment audio, acquiring the second environment audio and sending the second environment audio to the hotel server; and
and the event dialog text module is used for carrying out voice recognition based on the voiceprint characteristics on the second environment audio to generate an event dialog text.
9. A speech recognition device based on hotel scenes, comprising:
a processor;
a memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the hotel scenario-based speech recognition method of any one of claims 1 to 7 via execution of the executable instructions.
10. A computer-readable storage medium storing a program, which when executed by a processor performs the steps of the hotel scenario-based speech recognition method of any one of claims 1 to 7.
CN202210539264.9A 2022-05-18 2022-05-18 Voice recognition method, system, equipment and storage medium based on hotel scene Pending CN114937453A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210539264.9A CN114937453A (en) 2022-05-18 2022-05-18 Voice recognition method, system, equipment and storage medium based on hotel scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210539264.9A CN114937453A (en) 2022-05-18 2022-05-18 Voice recognition method, system, equipment and storage medium based on hotel scene

Publications (1)

Publication Number Publication Date
CN114937453A true CN114937453A (en) 2022-08-23

Family

ID=82864875

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210539264.9A Pending CN114937453A (en) 2022-05-18 2022-05-18 Voice recognition method, system, equipment and storage medium based on hotel scene

Country Status (1)

Country Link
CN (1) CN114937453A (en)

Similar Documents

Publication Publication Date Title
CN108388638B (en) Semantic parsing method, device, equipment and storage medium
US10977299B2 (en) Systems and methods for consolidating recorded content
CN107481720B (en) Explicit voiceprint recognition method and device
CN1333363C (en) Audio signal processing apparatus and audio signal processing method
CN102623011B (en) Information processing apparatus, information processing method and information processing system
US20150255066A1 (en) Metadata extraction of non-transcribed video and audio streams
CN108874895B (en) Interactive information pushing method and device, computer equipment and storage medium
CN107463700B (en) Method, device and equipment for acquiring information
CN106971009A (en) Speech data library generating method and device, storage medium, electronic equipment
CN106713111B (en) Processing method for adding friends, terminal and server
CN113488024B (en) Telephone interrupt recognition method and system based on semantic recognition
CN112530408A (en) Method, apparatus, electronic device, and medium for recognizing speech
CN110689261A (en) Service quality evaluation product customization platform and method
CN111683317B (en) Prompting method and device applied to earphone, terminal and storage medium
CN110933225B (en) Call information acquisition method and device, storage medium and electronic equipment
US11545140B2 (en) System and method for language-based service hailing
CN109710799B (en) Voice interaction method, medium, device and computing equipment
CN109947971B (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN112562682A (en) Identity recognition method, system, equipment and storage medium based on multi-person call
CN110136726A (en) A kind of estimation method, device, system and the storage medium of voice gender
CN111028834B (en) Voice message reminding method and device, server and voice message reminding equipment
CN107680584B (en) Method and device for segmenting audio
CN109961789B (en) Service equipment based on video and voice interaction
CN111400463B (en) Dialogue response method, device, equipment and medium
CN110889008B (en) Music recommendation method and device, computing device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination