CN112331207A

CN112331207A - Service content monitoring method and device, electronic equipment and storage medium

Info

Publication number: CN112331207A
Application number: CN202011060127.4A
Authority: CN
Inventors: 廖光朝
Original assignee: Audio Digital Huiyuan Shanghai Intelligent Technology Co ltd
Current assignee: Audio Digital Huiyuan Shanghai Intelligent Technology Co ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-02-05

Abstract

The application relates to a service content monitoring method and device and electronic equipment. The method comprises the following steps: acquiring a service expression set and a voice to be recognized; the service expression set comprises a plurality of service expressions; recognizing the speech to be recognized based on a pre-trained speech recognition model to obtain an index network; performing character matching on the index network and the service expression set to determine service expressions matched with the speech to be recognized; extracting a target keyword from the service expression matched with the voice to be recognized; and determining the service content according to the target keyword. By adopting the method, the information loss caused by monitoring the service content based on the video can be compensated in the protection.

Description

Service content monitoring method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of home-based care technologies, and in particular, to a method and an apparatus for monitoring service content, an electronic device, and a storage medium.

Background

With the increasing of the population of the old, the home care service for the aged is generated. The home care service is a service with a certain service duration provided for the old by professional trained service personnel.

When a service person provides home care service for the old at home, a care service manager firstly needs to confirm whether the service person provides preset service content for the old at home according to an agreement. At present, video monitoring equipment is mainly used for continuously monitoring the behaviors of service personnel, and the service content of the service personnel is determined based on monitoring videos. However, since the storage space occupied by the surveillance video is large, a large storage space is required for determining the service content by the surveillance video.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a service content method, an apparatus, an electronic device and a storage medium capable of saving storage space.

A service content monitoring method, the method comprising:

acquiring a service expression set and a voice to be recognized; the service expression set comprises a plurality of service expressions;

recognizing the voice to be recognized based on a pre-trained voice recognition model to obtain an index network;

determining the service expression matched with the speech to be recognized by performing character matching on the index network and the service expression set;

extracting a target keyword from the service expression matched with the voice to be recognized;

and determining service content according to the target keywords.

In one embodiment, the training step of the speech recognition model includes:

acquiring a sample text and a pronunciation dictionary corresponding to the sample voice; the sample text comprises at least one word to be labeled;

carrying out pronunciation labeling on the participles to be labeled according to the pronunciation dictionary to obtain a label sequence;

training a speech recognition model based on the sample speech and a corresponding tag sequence.

In one embodiment, the pronunciation dictionary comprises pronunciation participles and corresponding pronunciation labels; the pronunciation labeling of the participle to be labeled according to the pronunciation dictionary comprises the following steps:

performing word segmentation matching on the word segmentation to be labeled and the pronunciation dictionary, and judging whether pronunciation word segmentation matched with the word segmentation to be labeled exists in the pronunciation dictionary or not based on a word segmentation matching result;

when the pronunciation dictionary has pronunciation participles matched with the participles to be labeled, labeling the participles to be labeled according to pronunciation labels corresponding to the matched pronunciation participles;

when the pronunciation dictionary does not have pronunciation participles matched with the participles to be labeled, segmenting the participles to be labeled based on a preset rule to obtain participle segments;

and taking the word segmentation segment as a word to be labeled, and returning to the step of performing word segmentation matching on the word to be labeled and the pronunciation dictionary until the pronunciation word matched with the word to be labeled exists in the pronunciation dictionary.

In one embodiment, the speech recognition model comprises a speech separation enhancement model and a target recognition model; the training step of the speech recognition model comprises the following steps:

acquiring a first loss function of a voice separation enhancement model and a second loss function of a target recognition model;

performing back propagation based on the second loss function to train an intermediate model bridged between the speech separation enhancement model and the target recognition model to obtain a robust representation model;

fusing the first loss function and the second loss function to obtain a target loss function;

and performing joint training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and finishing the training when a preset convergence condition is met.

In one embodiment, the recognizing the speech to be recognized based on the pre-trained speech recognition model to obtain the index network includes:

extracting the voice characteristics of the voice to be recognized, and determining the pinyin of each participle in the voice to be recognized based on the voice characteristics; the pinyin is composed of one to a plurality of sound units;

determining a mapping relation between a sound unit in the pinyin and a corresponding fuzzy sound;

determining a candidate character sequence according to a pronunciation dictionary and the mapping relation;

and generating an index network based on the candidate character sequence.

In one embodiment, the determining the service expression matching the speech to be recognized by character matching the index network and the service expression set includes:

determining service expression matched with each candidate character sequence in the index network by performing character matching on the index network and the service expression set;

calculating the offset distance of each candidate word sequence relative to the matched service expression;

screening out a target character sequence from the candidate character sequences based on the offset distance;

and judging the service expression matched with the target character sequence as the service expression matched with the voice to be recognized.

In one embodiment, the method further comprises:

determining all extracted target keywords;

determining the generation time of each target keyword;

generating a care report based on the generation time and the target keyword.

A service content monitoring apparatus, the apparatus comprising:

the index network generation module is used for acquiring the service expression set and the voice to be recognized; the service expression set comprises a plurality of service expressions; recognizing the voice to be recognized based on a pre-trained voice recognition model to obtain an index network;

the target keyword extraction module is used for performing character matching on the index network and the service expression set to determine service expressions matched with the speech to be recognized; and extracting a target keyword from the service expression matched with the voice to be recognized.

And the service content determining module is used for determining service content according to the target keyword.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

and determining service content according to the target keywords.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

in the home-based care service process, a service expression set and a voice to be recognized are obtained; the service expression set comprises a plurality of service expressions;

and determining the service content of the home care service according to the target keyword.

According to the service content monitoring method, the service content monitoring device, the electronic equipment and the storage medium, the service expression set and the speech to be recognized are obtained, and the speech to be recognized can be subjected to speech recognition based on the pre-trained speech recognition model, so that an index network containing a plurality of candidate recognition results is obtained; by performing character matching on the index network and the service expression set, the service expression which can most represent the voice to be recognized can be screened out from the service expression set, so that the target keyword extracted based on the service expression which can most represent the voice to be recognized is more accurate; by extracting the target keywords, the service content can be determined according to the meaning of the target keywords, so that the service content of service personnel can be effectively monitored. Because the service content is monitored based on the voice with smaller storage space, compared with the traditional content monitoring based on video, the method and the device can effectively save the storage space consumed when the service content is monitored.

Drawings

FIG. 1 is a diagram of an exemplary embodiment of a method for monitoring service content;

FIG. 2 is a flow diagram illustrating a method for monitoring service content in one embodiment;

FIG. 3 is a schematic diagram of an indexing network in one embodiment;

FIG. 4 is a flowchart illustrating a method for training a speech recognition model according to one embodiment;

FIG. 5 is a block diagram showing the construction of a service content monitoring apparatus according to an embodiment;

FIG. 6 is a diagram illustrating an internal structure of an electronic device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The service content monitoring method provided by the application can be applied to the application environment shown in fig. 1. Wherein the microphone box 102 communicates with the host box 104 over a network. In the process that a service person provides home care service for the elderly at home, the microphone box 102 is worn on the body of the service person and used for capturing voice of the service person and identifying the voice to obtain a target keyword, and when the target keyword is consistent with a service item currently executed by the service person, the service person is judged to execute the service item. The mike box 102 sends the target service keyword to the host box 104, and the host box 104 stores the target service keyword.

In one embodiment, as shown in fig. 2, a service content monitoring method is provided, which is described by taking the example that the method is applied to the microphone box in fig. 1, and includes the following steps:

s202, acquiring a service expression set and a voice to be recognized.

The term set is a set including at least one term. The term of service refers to a standard term with respect to honor and friendliness used by service personnel and a served object in language communication during home-based care service, for example, the term of service may be "ask for a proper water temperature? "," provide hair washing service for you now ", etc. The voice to be recognized refers to audio information acquired by the microphone box in real time.

Specifically, when it is determined that service personnel starts to provide home care service for the old at home, the microphone box acquires a service expression set, and acquires audio information in real time to obtain to-be-recognized voice.

In one embodiment, the host box has pre-stored therein a correspondence between service items and service expression subsets, with different service expression subsets corresponding to different service items. The service items refer to service contents which should be provided by service personnel in the whole home-based care service process, for example, the service items can be hair washing, massage and the like. It is readily understood that throughout the home care service, a service person may provide a variety of different service items for the serviced object. The service term subset refers to the canonical terms associated with the service item that should be used by the service person in executing the service item.

Before providing home-based care-of-the-old service for a served object, a service person can agree with a service item to be provided with the served object, and generate order data based on the agreed care-of-the-old service item, so that the microphone box can acquire a corresponding service expression subset based on the care-of-the-old service item in the order data, and generate a service expression set based on the acquired service expression subset.

In one embodiment, the order data includes service items and service times for each service item. The host box acquires order data and acquires a corresponding service expression set based on service items in the order data. The host box determines the service content to be carried out by the service personnel at the current moment according to the service time in the order data, and correspondingly displays the service expression associated with the service content in the screen of the host box, so that the service personnel can carry out language communication with the served object by using the standard expression according to the screen prompt information.

And S204, recognizing the voice to be recognized based on the pre-trained voice recognition model to obtain an index network.

The voice recognition is to convert the input voice signal into a text corresponding to the input voice signal. The speech recognition model refers to a machine learning model with speech feature extraction capability. The voice feature is data for reflecting an audio feature. The voice features can be one or more of tone, pronunciation, frequency spectrum and the like.

Specifically, a voice recognition model is preset in the microphone box. The speech recognition model includes an endpoint detection submodel, an acoustic submodel, and a language submodel. The endpoint detection submodel is used for separating voice signals and non-voice signals. And the endpoint detection sub-model performs framing processing on the voice to be recognized, extracts the characteristic parameters of the voice frame and determines a voice segment and a non-voice segment based on the characteristic parameters of the voice frame. More specifically, the speech segments and the non-speech segments may be determined based on short-time energy and zero-crossing rate, information entropy, short-time energy frequency values, template matching, and the like.

The acoustic submodel is a model for describing the connection between speech features and the speech modeling unit, and is an important part of a speech recognition system. The conventional speech recognition Model generally adopts a GMM-HMM (Gaussian Mixture Models-high Markov Models) acoustic Model, wherein the GMM Models the distribution of speech acoustic features, and the HMM Models the timing of speech signals. However, the GMM is essentially a shallow network model, and has a weak capability of describing the distribution of acoustic feature state controls, and when training speech data is huge, speech recognition accuracy is low. The acoustic modeling is performed by using CNN-HMM (volumetric Neural Networks-Hidden Markov Model). The CNN is a deep model, and the distribution of any data can be self-adaptively fitted through self parameter adjustment, so that higher identification accuracy can be realized.

And after the voice fragment is obtained, the acoustic submodel extracts the characteristics of the voice fragment and identifies the voice based on the extracted characteristic information to obtain a pinyin sequence corresponding to the voice to be identified. For example, when the speech to be recognized is "do it well to wash one's head", the pinyin sequence obtained by the acoustic submodel is "xi ge tou hao ma".

The language sub-model is used for predicting the occurrence probability of candidate character sequences corresponding to the pinyin sequences and generating an index network based on the occurrence probability. Due to the existence of homophone characters, when a pinyin sequence is obtained, the language sub-model determines N-1 characters through the pinyin sequence, and predicts the probability of the occurrence of the next character based on the N-1 characters, so that one or more candidate character sequences corresponding to the pinyin sequence are obtained, and an index network is generated based on the obtained candidate character sequences. For example, when the pinyin sequence is "xi ge tou hao ma", the alphabetic character corresponding to "xi" may be "washing" and "west", the alphabetic character predicted based on "washing" and "ge tou" may be "head-in", the alphabetic character predicted based on "west" and "ge tou" may be "head-in", and the alphabetic characters predicted based on "washing", "head-in", "hao ma", and "west", "head-in", and "hao ma" are all "good-do", the generated index network is as shown in fig. 3. FIG. 3 is a diagram of an indexing network in one embodiment. The candidate character sequence is a character sequence formed by connecting nodes and line segments by taking a starting node as a starting point and an ending node as a terminal point, for example, "do you wash their heads" is a candidate character sequence.

And S206, performing character matching on the index network and the service expression set to determine the service expression matched with the speech to be recognized.

Specifically, the microphone box performs character matching on the index network and the service expression set, determines the service expression matched with each candidate word sequence in the index network, and calculates the offset distance of each candidate word sequence relative to the matched service expression. Where offset distance refers to the ratio of the number of characters not present in the matching service expression to the number of characters present in the matching service expression, where the tag symbol is not counted in. For example, when the candidate word sequence is "do-it-wash", and the matched service phrase is "begin-do-wash", the number of characters that do not exist in "begin-do-wash" in "do-wash" is 1, and the number of characters that exist in "begin-do-wash" is 4, so that the offset distance is 1/4. And the microphone box takes the candidate character sequence with the minimum offset distance as a target character sequence, and judges the service expression matched with the target character sequence as the service expression matched with the voice to be recognized.

In one embodiment, the microphone box analyzes the order data to obtain all service items to be performed by the current home care service, and determines a specific service time period of each service item. When an index network corresponding to the voice to be recognized is generated, the microphone box determines the acquisition time for acquiring the voice to be recognized, and determines the service items currently executed by the service personnel based on the acquisition time and the specific service time periods of the service items. The microphone box screens out candidate service expression associated with the service item currently being executed from the service expression set, and performs character matching on the candidate word sequences and the candidate service expression to obtain service expression matched with each candidate word sequence. By determining the acquisition time for acquiring the voice to be recognized and the specific service time period of each service item, the candidate service term associated with the currently executed service item can be screened out from the service term set, so that the microphone box only needs to perform character matching on the screened candidate service term, and does not need to perform character matching on the whole service term set, thereby greatly improving the matching efficiency.

And S208, extracting a target keyword from the service expression matched with the voice to be recognized.

And S210, determining service content according to the target keywords.

The target keywords refer to keywords capable of representing service items, for example, when the service content is hair washing and massage, the corresponding target keywords may be "hair washing" and "massage".

Specifically, the service manager labels target keywords in each service expression in advance, so as to extract the target keywords from the service expression matched with the speech to be recognized based on the labeling result. For example, the target keyword "washing head" can be labeled in advance based on < s >, </s >, "to obtain the service phrase" start < s > washing head > is good ", the microphone box can extract the target keyword from the matched service phrase only by identifying < s >, </s >, and the service content of the endowment service is determined based on the target keyword.

In one embodiment, when the specific service time period of each endowment service item is determined based on the order data and the acquisition time of the voice to be recognized is determined, the microphone box judges the service item which should be performed by the current service personnel based on the acquisition time and the specific service time period, compares the service item which should be performed with the target keyword, and can judge that the service personnel is providing the corresponding domestic endowment service according to the order data when the service item which should be performed is consistent with the target keyword.

In one embodiment, when the target keywords in the speech to be recognized are determined, the microphone box sends the target keywords to the host box, and the host box stores the target keywords correspondingly. And then, the microphone box correspondingly deletes the voice to be recognized so as to protect the privacy of the service personnel and the serviced object.

In the service content monitoring method in the home-based care service process, the service expression set and the speech to be recognized are obtained, and the speech to be recognized can be subjected to speech recognition based on the pre-trained speech recognition model to obtain an index network containing a plurality of candidate recognition results; by carrying out character matching on the index network and the service expression set, the service expression which can most represent the voice to be recognized can be screened out from the service expression set, so that the service content of service personnel is effectively monitored. Because the service content is monitored based on the voice with smaller storage space, compared with the traditional content monitoring based on video, the method and the device can effectively save the storage space consumed when the service content is monitored.

In one embodiment, the training of the speech recognition model comprises: acquiring a sample text and a pronunciation dictionary corresponding to the sample voice; the sample text comprises at least one word to be labeled; carrying out pronunciation labeling on the word to be labeled according to a pronunciation dictionary to obtain a label sequence; a speech recognition model is trained based on the sample speech and the corresponding tag sequence.

The sample speech refers to speech data used for training a speech recognition model. The sample text refers to text data obtained after speech recognition is performed on the sample speech. The sample text includes a positive sample and a negative sample, the positive sample refers to the text data containing the target keyword, and the negative sample refers to the text data not containing the target keyword. The pronunciation dictionary refers to a dictionary for determining the mapping relationship between tone syllables and consonants and vowels of the participles. The pronunciation dictionary contains pronunciations of all the characters and participles in the sample data.

Specifically, a model training person obtains as many sample voices as possible, performs voice recognition on the sample voices manually to obtain corresponding sample texts, and then inputs the sample texts and pronunciation dictionaries corresponding to the sample voices into the voice recognition model. The voice recognition model carries out word segmentation processing on the sample text to obtain a plurality of to-be-labeled words, a pronunciation label corresponding to the to-be-labeled word is inquired in a pronunciation dictionary, and pronunciation labeling is carried out on the to-be-labeled word based on the pronunciation label. And combining the pronunciation labels corresponding to the participles to be labeled by the microphone box to obtain a label sequence. For example, the labeling format of each participle in the pronunciation dictionary is as follows: the initial consonant + the final + the tone, 1-4 correspond to four tones, 5 is a soft sound, so that the pronunciation label corresponding to the participle "a 1 j iu 3" to be labeled can be.

Further, the voice recognition model conducts model training on the acoustic submodel and the language submodel based on the sample voice and the corresponding label sequence until the trained model parameters meet preset requirements.

In one embodiment, the sample speech may be subjected to speech recognition to obtain a corresponding sample text, and the sample text may be subjected to word segmentation. Since the recognition accuracy of the long keyword is lower than that of the short keyword, in order to improve the keyword recognition accuracy, the long keyword may be split into the short keyword, for example, "hair washing service" may be split into "hair washing/service", where "/" is a part-word symbol.

In this embodiment, treat automatically through pronunciation dictionary and mark the participle and carry out pronunciation mark processing, compare in traditional manual work and carry out pronunciation mark, this application not only can promote marking efficiency, can practice thrift the manpower resources that consume when the manual work carries out pronunciation mark moreover.

In one embodiment, the pronunciation labeling of the segmentation words to be labeled according to the pronunciation dictionary comprises the following steps: performing word segmentation matching on the segmentation words to be labeled and the pronunciation dictionary, and judging whether pronunciation segmentation words matched with the segmentation words to be labeled exist in the pronunciation dictionary or not based on a word segmentation matching result; when the pronunciation dictionary has pronunciation participles matched with the participles to be labeled, labeling the participles to be labeled according to pronunciation labels corresponding to the matched pronunciation participles; when the pronunciation dictionary does not have pronunciation participles matched with the participles to be labeled, segmenting the participles to be labeled based on a preset rule to obtain participle segments; and taking the segmentation segment as a segmentation to be labeled, and returning to the step of performing segmentation matching on the segmentation to be labeled and the pronunciation dictionary until pronunciation segmentation matched with the segmentation to be labeled exists in the pronunciation dictionary.

The pronunciation dictionary comprises pronunciation participles and corresponding pronunciation labels. The pronunciation participle refers to a single participle or a character, and the pronunciation label refers to label information obtained by labeling the pronunciation participle in a labeling format of initial consonant, final sound and tone.

Specifically, the microphone box performs segmentation matching on the segmentation words to be labeled and each pronunciation segmentation word in the pronunciation dictionary, and judges whether the pronunciation segmentation word matched with the segmentation word to be labeled exists in the pronunciation dictionary according to a matching result. When the pronunciation dictionary has pronunciation participles which are the same as the participles to be labeled, the microphone box takes the pronunciation labels corresponding to the same pronunciation participles as the labeling results of the participles to be labeled. And when the pronunciation dictionary does not have pronunciation participles the same as the participles to be labeled, the microphone box divides the participles to be labeled based on a preset rule to obtain participle segments. For example, the preset rule may be to divide the word to be labeled by using the middle character as a division point, so that when the word to be labeled is "lychee yuan cell", the word to be labeled may be divided into "lychee yuan" and "cell" based on the preset rule.

Further, the microphone box takes each word segmentation segment as a word to be labeled, and returns to the step of performing word segmentation matching on the word to be labeled and the pronunciation dictionary until the pronunciation dictionary has the pronunciation word matched with the word to be labeled. For example, when there is no pronunciation word matching with "lychee aster" in the pronunciation dictionary, the microphone box further divides "lychee aster" to obtain "lychee" and "aster", and labels "lychee" and "aster" respectively based on the pronunciation dictionary.

In the embodiment, the segmentation to be labeled can be continuously segmented based on the pronunciation dictionary, so that when the segmentation to be labeled is a rare word, the label labeling can be carried out based on the segmentation dictionary.

In one embodiment, as shown in FIG. 4, the training step of the speech recognition model includes:

s402, a first loss function of the voice separation enhancement model and a second loss function of the target recognition model are obtained.

S404, performing back propagation based on a second loss function to train an intermediate model bridged between the voice separation enhancement model and the target recognition model to obtain a robust representation model.

And S406, fusing the first loss function and the second loss function to obtain a target loss function.

S408, performing combined training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and finishing the training when a preset convergence condition is met.

The voice recognition model comprises a voice separation enhancement model and a target recognition model; the target recognition model includes an acoustic submodel and a language submodel. The speech separation enhancement model is a model with speech separation and/or enhancement capability after training, and specifically may be a model obtained by performing learning training with sample speech as training data and separating target speech from background interference in the sample speech. It is understood that the Voice separation enhancement model may also have the capability of performing Voice Activity Detection (VAD), echo cancellation, reverberation cancellation, or sound source localization preprocessing on the Voice signal, which is not limited herein. The target recognition model is an acoustic model with speech recognition capability after training, and specifically may be a model for performing phoneme recognition on sample speech, which is obtained by performing learning training with sample speech and a tag sequence as training data. The speech separation enhancement model and the target recognition model, respectively, may be pre-trained. The pre-trained speech separation enhancement model and the speech recognition model each have a fixed model structure and model parameters.

Specifically, in order to further improve the recognition accuracy of the speech model, a speech separation enhancement model may be added to the speech model, and the speech model may be further trained based on the speech separation enhancement model. When the combined model training is needed, the microphone box obtains a pre-trained voice separation enhancement model and a target recognition model, a first loss function adopted when the voice separation enhancement model is pre-trained, and a second loss function adopted when the target recognition model is pre-trained. The loss function (loss function) is usually associated with the optimization problem as a learning criterion, i.e. the model is solved and evaluated by minimizing the loss function. The first loss function adopted by the pre-training speech separation enhancement model and the second loss function adopted by the pre-training speech recognition model respectively can be mean square error, mean absolute value error, Log-Cosh loss, quantile loss, ideal quantile loss and the like.

The traditional method mainly splits a voice processing task into two completely independent subtasks: a voice separation task and a target recognition task. Therefore, in the training stage, the voice separation enhancement model and the target recognition model are allowed to be trained respectively in a modularization mode, and in the production testing stage, the enhanced voice separation enhancement model is output and then is to be recognized and input into the target recognition model for recognition. It is easy to find that this approach does not solve the differentiation problem between the two characterization categories well. In practical application scenarios such as home care service, the phenomenon that the voice to be recognized is influenced by background music or interference of multiple speakers generally exists. Therefore, relatively serious distortion is introduced when the front-end speech processing is carried out by the speech separation enhancement model, and the distortion is not considered in the training stage of the target recognition model, so that the independent front-end speech separation enhancement model and the independent rear-end target recognition model are directly cascaded, and the final speech recognition performance is seriously reduced.

To overcome the difference between the two characterization categories, embodiments of the present application bridge the intermediate model to be trained between the speech separation enhancement model and the target recognition model. The trained intermediate model may be referred to as a robust representation model. More specifically, the microphone box determines a local descent gradient generated by the second loss function in each iteration process according to a preset deep learning optimization algorithm. And the microphone box reversely transmits the local descending gradient to the intermediate model so as to update the model parameters corresponding to the intermediate model until the training is finished when the preset training stopping condition is met.

The microphone box performs pre-analysis on the first loss function and the second loss functionAnd setting logic operation to obtain a target loss function. Taking weighted summation as an example, assume that the weighting factor is λ_SSIf the target loss function L is equal to L₂+λ_SSL₁. The weighting factor may be an empirically or experimentally set value, such as 0.1. It is easy to find that the importance of the speech separation enhancement model in the multi-model joint training can be adjusted by adjusting the weighting factor. And the microphone box determines the global descent gradient generated by the target loss function according to a preset deep learning optimization algorithm. The deep learning optimization algorithm for determining the local gradient of descent may be the same as or different from the deep learning optimization algorithm for determining the global gradient of descent. And the global descending gradient generated by the target loss function is sequentially and reversely propagated from the target recognition model to each layer of the network of the robust representation model and the voice separation enhancement model, and model parameters corresponding to the voice separation enhancement model, the robust representation model and the target recognition model are respectively subjected to iterative updating in the process until a preset training stop condition is met, and the training is ended.

In this embodiment, the intermediate model completes training by means of the second loss function back propagation of the rear-end target recognition model, and the speech separation enhancement model and the target recognition model may be pre-selected and trained, so that convergence may be achieved after a small number of iterative training times. In addition, the end-to-end network model is jointly trained on the basis of the combination of the front-end model and the back-end model corresponding to the loss function respectively, so that each independent model in the network architecture can comprehensively learn the interference characteristics from the voice signals in the complex acoustic environment, the performance of the global voice processing task can be guaranteed, and the voice recognition accuracy is improved.

In one embodiment, recognizing the speech to be recognized based on the pre-trained speech recognition model, and obtaining the index network includes: extracting the voice characteristics of the voice to be recognized, and determining the pinyin of each word segmentation in the voice to be recognized based on the voice characteristics; the pinyin is composed of one to a plurality of sound units; determining the mapping relation between the sound unit in the pinyin and the corresponding fuzzy sound; determining a candidate character sequence according to the pronunciation dictionary and the fuzzy sound; and generating an index network based on the candidate character sequence.

The fuzzy sound in the pinyin can be a sound unit close to the pronunciation of the pinyin. The ambiguities may be due to the same semantics pronouncing differently in different dialects. The sound unit refers to the initial consonant or final sound which forms the pinyin.

Specifically, when the acoustic submodel obtains the to-be-recognized speech output by the speech separation enhancement model, the speech features in the to-be-recognized speech may be extracted based on a preset convolution kernel. For example, pronunciation features in the speech to be recognized are extracted. Meanwhile, the acoustic submodel inputs the voice characteristics into the language submodel, and the language submodel determines the pinyin corresponding to each participle in the voice to be recognized according to the voice characteristics. The language sub-model obtains a fuzzy sound table, and the fuzzy sound table is used for inquiring all sound units in the pinyin of each participle to obtain sound units with fuzzy sound, so that the mapping relation between the sound units with fuzzy sound and the fuzzy sound is established. For example, when the sound unit is "g", the fuzzy sound determined based on the fuzzy sound table is "j".

Further, the language submodel combines the sound unit and the fuzzy sound based on the mapping relation to obtain one or more candidate pinyins corresponding to each participle. For example, when the sound units are "g" and "ai", and the fuzzy sounds determined based on the fuzzy sound table are "j" and "ei", the candidate pinyins obtained by combining are "gai", "jei", "gei" and "jai". The language submodel inquires pronunciation participles corresponding to the candidate pinyin in a pronunciation dictionary, generates a candidate character sequence based on the pronunciation participles corresponding to each participle in the voice to be recognized, and generates an index network according to the candidate character sequence.

In the embodiment, under the influence of dialects, the recognition results of the voice recognition model on the same word segmentation are possibly different, and a plurality of candidate recognition results can be obtained by the method provided by the embodiment, so that the target recognition result can be determined in the plurality of candidate recognition results, and thus, the influence of the dialects on the recognition results can be effectively overcome.

In one embodiment, determining the service expression matching the speech to be recognized by character matching the index network and the service expression set comprises: performing character matching on the index network and the service expression set to determine service expressions matched with each candidate character sequence in the index network; calculating the offset distance of each candidate word sequence relative to the matched service expression; screening out a target character sequence from the candidate character sequences based on the offset distance; and judging the service expression matched with the target character sequence as the service expression matched with the voice to be recognized.

Specifically, the microphone box traverses each candidate word sequence in the index network, and performs character matching on the candidate word sequence and each service expression in the service expression set until the service expression matched with each candidate word sequence in the index network is determined. More specifically, the microphone box determines a candidate word sequence of the current traversal order, and takes the service expression having the largest number of repeated characters as the service expression matching the candidate word sequence of the current traversal order. For example, when the candidate character in the current traversal order is "how good to wash one's head", and the service phrase in the service phrase set is "how good to start washing one's head", the service having the largest number of repeated characters with "how good to wash one's head" is called "how good to start washing one's head".

Further, the microphone box calculates the offset distance of each candidate character sequence relative to the matched service expression, takes the candidate character sequence with the minimum offset distance as a target character sequence, and judges the service expression matched with the target character sequence as the service expression matched with the voice to be recognized.

In this embodiment, since the service expression with the smallest offset distance is determined as the service expression corresponding to the speech to be recognized, the service expression screened out based on the offset distance is the language word that can represent the speech to be recognized most, and thus the target keyword determined based on the language word that can represent the speech to be recognized most is more accurate.

In one embodiment, the service content monitoring method further includes: determining all extracted target keywords; determining the generation time of each target keyword; a care report is generated based on the generation time and the target keywords.

Specifically, when the fact that the home care service is executed is determined to be completed, the microphone box obtains all target keywords extracted in the home care process, determines the generation time of each target keyword, generates a nursing report according to each target keyword and the generation time of each target keyword, and then sends the generated nursing report to the served object.

In this embodiment, by generating the nursing report, the family of the served object can know the specific service items provided by the service staff in the process of home-based care service according to the nursing report.

It should be understood that although the steps in the flowcharts of fig. 2 and 4 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 and 4 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the other steps or stages.

In one embodiment, as shown in fig. 5, there is provided a service content monitoring apparatus 500 including: an index network generation module 502, a target keyword extraction module 504, and a service content determination module 506, wherein:

an index network generation module 502, configured to obtain a service expression set and a speech to be recognized; the service expression set comprises a plurality of service expressions; and recognizing the speech to be recognized based on the pre-trained speech recognition model to obtain the index network.

A target keyword extraction module 504, configured to perform character matching on the index network and the service expression set, and determine a service expression matched with the speech to be recognized; and extracting a target keyword from the service expression matched with the voice to be recognized.

A service content determining module 506, configured to determine the service content of the service according to the target keyword.

In one embodiment, the index network generation module 502 further includes a model training module 5021 for obtaining a sample text corresponding to the sample speech and a pronunciation dictionary; the sample text comprises at least one word to be labeled; carrying out pronunciation labeling on the word to be labeled according to a pronunciation dictionary to obtain a label sequence; a speech recognition model is trained based on the sample speech and the corresponding tag sequence.

In one embodiment, the model training module 5021 is further configured to perform segmentation matching on the segmentation to be labeled and the pronunciation dictionary, and determine whether pronunciation segmentation matched with the segmentation to be labeled exists in the pronunciation dictionary based on the segmentation matching result; when the pronunciation dictionary has pronunciation participles matched with the participles to be labeled, labeling the participles to be labeled according to pronunciation labels corresponding to the matched pronunciation participles; when the pronunciation dictionary does not have pronunciation participles matched with the participles to be labeled, segmenting the participles to be labeled based on a preset rule to obtain participle segments; and taking the segmentation segment as a segmentation to be labeled, and returning to the step of performing segmentation matching on the segmentation to be labeled and the pronunciation dictionary until pronunciation segmentation matched with the segmentation to be labeled exists in the pronunciation dictionary.

In one embodiment, the model training module 5021 is further configured to obtain a first loss function of the speech separation enhancement model and a second loss function of the target recognition model; performing back propagation based on the second loss function to train an intermediate model bridged between the speech separation enhancement model and the target recognition model to obtain a robust representation model; fusing the first loss function and the second loss function to obtain a target loss function; and performing joint training on the voice separation enhancement model, the robust representation model and the target recognition model based on the target loss function, and finishing the training when a preset convergence condition is met.

In one embodiment, the index network generation module 502 further includes a candidate character sequence determination module 5022, configured to extract a voice feature of the voice to be recognized, and determine the pinyin of each word in the voice to be recognized based on the voice feature; the pinyin is composed of one to a plurality of sound units; determining the mapping relation between the sound unit in the pinyin and the corresponding fuzzy sound; determining a candidate character sequence according to the pronunciation dictionary and the mapping relation; and generating an index network based on the candidate character sequence.

In one embodiment, the target keyword extraction module 504 further includes an offset distance determination module 5041, configured to determine a service expression that each candidate word sequence in the index network respectively matches by character matching the index network with the service expression set; calculating the offset distance of each candidate word sequence relative to the matched service expression; screening out a target character sequence from the candidate character sequences based on the offset distance; and judging the service expression matched with the target character sequence as the service expression matched with the voice to be recognized.

In one embodiment, the service content monitoring apparatus 500 is further configured to determine all of the extracted target keywords; determining the generation time of each target keyword; a care report is generated based on the generation time and the target keywords.

For specific limitations of the service content monitoring apparatus, reference may be made to the above limitations of the service content monitoring method, which will not be described herein again. The modules in the service content monitoring device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, an electronic device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The electronic device comprises a processor, a memory, a communication interface, a display screen and an input device which are connected through a system bus. Wherein the processor of the electronic device is configured to provide computing and control capabilities. The memory of the electronic equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the electronic equipment is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a service content monitoring method. The display screen of the electronic equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the electronic equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, an electronic device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

recognizing the speech to be recognized based on a pre-trained speech recognition model to obtain an index network;

performing character matching on the index network and the service expression set to determine service expressions matched with the speech to be recognized;

and determining the service content according to the target keyword.

In one embodiment, the processor, when executing the computer program, further performs the steps of:

carrying out pronunciation labeling on the word to be labeled according to a pronunciation dictionary to obtain a label sequence;

a speech recognition model is trained based on the sample speech and the corresponding tag sequence.

In one embodiment, the pronunciation dictionary includes pronunciation participles and corresponding pronunciation tags; the processor, when executing the computer program, further performs the steps of:

performing word segmentation matching on the segmentation words to be labeled and the pronunciation dictionary, and judging whether pronunciation segmentation words matched with the segmentation words to be labeled exist in the pronunciation dictionary or not based on a word segmentation matching result;

and taking the segmentation segment as a segmentation to be labeled, and returning to the step of performing segmentation matching on the segmentation to be labeled and the pronunciation dictionary until pronunciation segmentation matched with the segmentation to be labeled exists in the pronunciation dictionary.

extracting the voice characteristics of the voice to be recognized, and determining the pinyin of each word segmentation in the voice to be recognized based on the voice characteristics; the pinyin is composed of one to a plurality of sound units;

determining the mapping relation between the sound unit in the pinyin and the corresponding fuzzy sound;

determining a candidate character sequence according to the pronunciation dictionary and the mapping relation;

and generating an index network based on the candidate character sequence.

performing character matching on the index network and the service expression set to determine service expressions matched with each candidate character sequence in the index network;

determining all extracted target keywords;

determining the generation time of each target keyword;

a care report is generated based on the generation time and the target keywords.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

and determining the service content according to the target keyword.

In one embodiment, the computer program when executed by the processor further performs the steps of:

In one embodiment, the pronunciation dictionary includes pronunciation participles and corresponding pronunciation tags; the computer program when executed by the processor further realizes the steps of:

and generating an index network based on the candidate character sequence.

determining all extracted target keywords;

determining the generation time of each target keyword;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for service content monitoring, the method comprising:

and determining service content according to the target keywords.

2. The method of claim 1, wherein the step of training the speech recognition model comprises:

3. The method of claim 2, wherein the pronunciation dictionary comprises pronunciation participles and corresponding pronunciation tags; the pronunciation labeling of the participle to be labeled according to the pronunciation dictionary comprises the following steps:

4. The method of claim 1, wherein the speech recognition models include a speech separation enhancement model and a target recognition model; the training step of the speech recognition model comprises the following steps:

5. The method of claim 1, wherein the recognizing the speech to be recognized based on the pre-trained speech recognition model to obtain an index network comprises:

and generating an index network based on the candidate character sequence.

6. The method of claim 1, wherein determining the service expression matching the speech to be recognized by character matching the index network and the service expression set comprises:

7. The method of claim 1, further comprising:

determining all extracted target keywords;

determining the generation time of each target keyword;

generating a care report based on the generation time and the target keyword.

8. A service content monitoring apparatus, characterized in that the apparatus comprises:

the target keyword extraction module is used for performing character matching on the index network and the service expression set to determine service expressions matched with the speech to be recognized; extracting a target keyword from the service expression matched with the voice to be recognized;

9. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.