CN112837683B - Voice service method and device - Google Patents

Voice service method and device Download PDF

Info

Publication number
CN112837683B
CN112837683B CN202011623769.0A CN202011623769A CN112837683B CN 112837683 B CN112837683 B CN 112837683B CN 202011623769 A CN202011623769 A CN 202011623769A CN 112837683 B CN112837683 B CN 112837683B
Authority
CN
China
Prior art keywords
model
voice
customized
target
application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011623769.0A
Other languages
Chinese (zh)
Other versions
CN112837683A (en
Inventor
吴旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202011623769.0A priority Critical patent/CN112837683B/en
Publication of CN112837683A publication Critical patent/CN112837683A/en
Application granted granted Critical
Publication of CN112837683B publication Critical patent/CN112837683B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Abstract

The invention discloses a voice service method and a device, wherein in the method, a target application identifier in a user voice request is obtained; detecting whether a target customized voice model corresponding to the target application identifier exists in a model database, wherein the model database comprises a plurality of customized voice models and corresponding application identifiers; when the target customized voice model is detected to exist, determining a first voice processing result corresponding to user voice data in the user voice request by using the target customized voice model; and performing feedback service operation according to the first voice processing result. Therefore, the personalized voice service requirements of some application operators can be met conveniently.

Description

Voice service method and device
Technical Field
The invention belongs to the technical field of voice processing, and particularly relates to a voice service method and a voice service device.
Background
With the continuous development of modern intelligent technology and the iterative update of voice technology, the requirements of people on voice services are also continuously increasing and there is a trend of diversified development, such as voice interaction services, voice recognition services, and the like.
However, due to the diversification of service personalization requirements and the fast iteration speed of service data, the universal speech model cannot well meet the personalization requirements of users, for example, some services may need to recognize "xiaodu" as "small" instead of "disinfection", need to recognize "lixueqin" as "lyxiqin" instead of "lyxunlun", and the like.
In view of the above problems, the industry has not provided a better solution for the moment.
Disclosure of Invention
An embodiment of the present invention provides a voice service method and apparatus, which are used to solve at least one of the above technical problems.
In a first aspect, an embodiment of the present invention provides a voice service method, including: acquiring a target application identifier in a user voice request; detecting whether a target customized voice model corresponding to the target application identifier exists in a model database, wherein the model database comprises a plurality of customized voice models and corresponding application identifiers; when the target customized voice model is detected to exist, determining a first voice processing result corresponding to user voice data in the user voice request by using the target customized voice model; and performing feedback service operation according to the first voice processing result.
In a second aspect, an embodiment of the present invention provides a voice service apparatus, including: the application identification acquisition unit is configured to acquire a target application identification in a user voice request; a customized voice model detection unit configured to detect whether a target customized voice model corresponding to the target application identifier exists in a model database, wherein the model database comprises a plurality of customized voice models and corresponding application identifiers; a customized voice model using unit configured to determine a first voice processing result corresponding to user voice data in the user voice request by using the target customized voice model when the target customized voice model is detected to exist; and the feedback service unit is configured to perform feedback service operation according to the first voice processing result.
In a third aspect, an embodiment of the present invention provides an electronic device, including: the computer-readable medium includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the above-described method.
In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the above method.
The embodiment of the invention has the beneficial effects that:
by querying the model database, the voice service platform can call a target customized voice model corresponding to a target application identifier in the user voice request, determine a voice processing result for the user voice data, and perform feedback service by using the corresponding voice processing result. Therefore, an application operator can provide customized voice feedback service for a user by configuring a customized voice model corresponding to the application identifier in a model database of the server, for example, "xiaodu" can be recognized as "small" in the customized voice model, and personalized requirements of the operator can be met.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow chart illustrating an example of a voice service method according to an embodiment of the present invention;
FIG. 2 depicts a flowchart of one example of determining a target customized speech model identification from an application model identification mapping table, according to an embodiment of the invention;
FIG. 3 illustrates a flow diagram of an example of a feedback service operation based on a first speech processing result according to an embodiment of the present invention;
FIG. 4 shows a flowchart of an example of updating a model database according to an embodiment of the invention;
FIG. 5 depicts a flowchart of one example of determining a first customized speech model from a model data source, according to an embodiment of the invention;
FIG. 6 illustrates a flow diagram of an example of customizing a voice service via a voice service platform, in accordance with an embodiment of the present invention;
FIG. 7 is a flowchart illustrating an example of a voice service method according to an embodiment of the present invention;
fig. 8 is a block diagram showing an example of a voice service apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
As used in this application, the terms "module," "system" and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes in accordance with a signal having one or more data packets, e.g., signals from data interacting with another element in a local system, distributed system, and/or across a network of the internet with other systems by way of the signal.
Finally, it should be further noted that the terms "comprises" and "comprising," when used herein, include not only those elements but also other elements not expressly listed or inherent to such processes, methods, articles, or devices. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Fig. 1 is a flowchart illustrating an example of a voice service method according to an embodiment of the present invention. Regarding the execution subject of the method of the embodiment of the present invention, it may be various processors or controllers, such as a server or a control device in a voice service platform.
As shown in fig. 1, in step 110, the target application identification in the user voice request is obtained. Illustratively, a user may operate a voice application on a client to send a corresponding user voice request to a server to implement a functional operation such as voice recognition, voice interaction, or voice wakeup.
In step 120, it is detected whether a target customized speech model corresponding to the target application identification exists in the model database. Here, the model database includes a plurality of customized speech models and corresponding application identifications.
In some application scenarios, the voice service platform may authorize multiple application operators such that the application operators may upload respective customized voice models to the server and store in the model database.
If the detection in step 120 indicates that the targeted customized speech model is present, then a jump may be made to step 130. If the detection in step 120 indicates that the target customized speech model is not present, then a jump may be made to step 150.
In step 130, a first speech processing result corresponding to the user speech data in the user speech request is determined using the target customized speech model.
Then, in step 140, a feedback service operation is performed according to the first speech processing result.
In step 150, a feedback service operation is provided using the generic speech model.
By the embodiment of the invention, the voice service platform can automatically call different customized voice models aiming at different business applications, and can meet the personalized voice service requirements of different application operators. In addition, if no corresponding customized voice model exists for a certain business application, the general voice model can be used for providing services for the traffic of the business application, and the reliability of the voice business service of the platform can be guaranteed.
With respect to implementation details of step 120 described above, in some embodiments, it may be determined whether a target customized speech model identification corresponding to the target application identification exists according to an application model identification mapping table. Here, the application model identification mapping table includes a plurality of application identifications and corresponding customized speech model identifications, and the target customized speech model identification is used to locate a corresponding target customized speech model in the model database.
By the embodiment of the invention, the voice service platform can position the target customized voice model corresponding to the target application identifier in the model database by using the application model identifier mapping table, can effectively identify whether the customized voice model corresponding to the application identifier exists in the model database, and can quickly call the customized voice model corresponding to the application identifier.
In some application scenarios, a plurality of speech models of speech functions are required to cooperate cooperatively to provide speech services, such as an acoustic model module, a language model module, a natural language understanding model module, and a dialogue management model module, among others.
Specifically, in the embodiment of the present invention, the customized voice model includes a plurality of voice model modules, each of the voice model modules is configured to have a corresponding model module identification, and the customized voice model identification includes one or more customized voice model module identifications.
FIG. 2 depicts a flowchart of one example of determining a target customized speech model identification from an application model identification mapping table, according to an embodiment of the invention.
As shown in FIG. 2, in step 210, at least one target customized speech model module identification corresponding to the target application identification is determined according to the application model identification mapping table.
In step 220, for each speech processing stage for responding to the user's speech request, a speech model module type corresponding to the speech processing stage is obtained. Specifically, the speech service process may be composed of a plurality of speech processing stages, and each speech processing stage needs to perform a corresponding processing operation by using a corresponding type of speech model module. Illustratively, the voice service type may be a voice interaction service, and in the process of the voice interaction service, an ASR (Automatic Speech Recognition) stage, an NLU (Natural Language Understanding) stage, a DM (dialog Management) stage and an NLG (Natural Language Generation) stage may need to be performed in sequence, so as to implement voice interaction operation with a user. AM (Acoustic Model) and LM (Language Model) can be used in the ASR stage, NLU Model can be used in the NLU stage, DM Model can be used in the DM stage, and NLG Model can be used in the NLG stage.
In step 230, it is detected whether there is a corresponding customized voice model module identifier for each voice model module type in at least one target customized voice model module identifier.
If the detection result in step 230 indicates that there is no corresponding customized speech model module identification for the first speech model module type, then go to step 240. If the detection result in step 230 indicates that there is a corresponding customized speech model module identification for each speech model module type, then go to step 250.
In step 240, the general speech model module corresponding to the first speech model module type is utilized to perform the speech processing operation of the corresponding speech processing stage.
In step 250, the customized voice model module corresponding to each voice model module type is used to perform the voice processing operation of the corresponding voice processing stage.
In the user voice interaction service scene, when the NLU stage is finished and the user intention classification result is obtained, the key semantic slot value information accumulated in the wheel dialog can be extracted first, then the WEB API is accessed by using the information to query the result in the corresponding database, and then the NLG is generated by using different templates according to the query result. In some embodiments, the WEB API may be provided by an application operator, which may be used to access an enterprise repository.
By the embodiment of the invention, the table look-up operation can be carried out based on the target application identification, whether the corresponding customized model modules exist in different processing stages of the voice service is identified, and when the corresponding customized model modules exist in all the voice processing stages, the corresponding customized model modules are used for carrying out the voice processing operation in the corresponding stage. In addition, if no corresponding customized model module exists in a certain speech processing stage, the corresponding general model module can be utilized to provide services for the stage, so that an application service provider does not need to customize a corresponding functional model module for each stage, and the reliability of the speech service can be guaranteed.
Fig. 3 shows a flowchart of an example of a feedback service operation according to a first voice processing result according to an embodiment of the present invention.
In some application scenarios, the goal of the application operator to build the customized model is to improve the accuracy of the speech processing results, while the customized model may not be as accurate as the generic model when processing certain specific audio data, at which point the service provided by the generic model may be switched.
As shown in FIG. 3, in step 310, a second speech processing result corresponding to the user speech data is determined using the generic model.
In step 320, a target speech processing result is determined from the first speech processing result and the second speech processing result according to the confidence degrees corresponding to the first speech processing result and the second speech processing result. For example, assuming that the first speech processing result has a higher confidence, the first speech processing result may be taken as the target speech processing result.
In step 330, a feedback service operation is performed according to the target speech processing result.
In the embodiment of the invention, the service is provided by utilizing the corresponding voice processing result with higher confidence coefficient in the target customized voice model and the general voice model, so that the reliability of the recognition result of the voice model can be improved, and the occurrence probability of deviation service can be reduced.
In the related art, through some skill customization platforms or software open development platforms, developers can develop resources required for customization. Illustratively, in the Dueros hotfix system, a user may upload custom files, such as the three types of files, intent.dic, fact.dic, command.dic. Specifically, each line of intent.dic contains a content skill identifier, an intention and a saying, and the system performs rule matching according to a line sequence during running; the content of the dit.dic file is a thesaurus name and a thesaurus value of the saying, and the thesaurus name can be quoted in the saying of the intent.dic file; the content of the command.dic file is a natural language text expanded by the intent.dic file and the ditct.dic file and is used for background audit of workers. In addition, the uploaded hotfix resource can be immediately validated after the staff review is passed.
However, the developer does not know which resources are passed through before, which easily causes the resources to be covered by errors, cannot solve the problem of inaccurate speech recognition, does not support customizing NLG, and cannot be associated with the enterprise knowledge base service of the developer.
In view of this, the voice service platform provided in the embodiment of the present invention may further have a voice service customizing function. FIG. 4 shows a flowchart of an example of updating a model database according to an embodiment of the invention.
As shown in FIG. 4, in step 410, a model data source and corresponding first application identification are obtained.
For example, the application operator may upload model data sources and corresponding application identifications corresponding to the desired customized voice scenarios to the voice platform server.
In some embodiments, the application operator may interact with the server through development of the client. Specifically, the application operator can access the voice service platform server by setting the link. The voice service platform server may then send a platform authorization login notification to the development client. Thereafter, the development client can display a model customization interface having a customized file upload control, and the model data source and the first application identification can be uploaded using the customized file upload control. The voice service platform server may then receive the model data source and the first application identification from the development client. Therefore, the application operator can upload the corresponding customized resource information through visual operation without API programming, and development users do not need programming experience, thereby being beneficial to popularizing the service application range of the voice customization service.
In step 420, a corresponding first customized speech model is determined based on the model data source.
For example, the speech platform server may train the speech model based on the model data source to obtain a corresponding first customized speech model. It should be appreciated that the model data source may be utilized to optimally train the original speech model, and the model data source may also be utilized to construct a completely new speech model.
In step 430, a model database is updated based on the first customized speech model and the first application identification.
In some application scenarios, the voice service platform can receive corresponding model data sources from different application operators, determine corresponding customized voice models to update the model database, can provide greater business freedom for the application operators, and can meet personalized requirements of voice services.
FIG. 5 illustrates a flow diagram of one example of determining a first customized speech model from a model data source according to an embodiment of the invention.
As shown in FIG. 5, at least one second customized speech model corresponding to the first application identification is determined in a model database in step 510. It will be appreciated that in some cases, there may be multiple speech services provided by the application operator, so there may be a corresponding plurality of customized speech models, and in other cases, the second customized speech model may be equivalent to the customized speech model module used to service the corresponding speech processing stage as described above.
For example, the Dueros hotfix system may not consider ASR recognition accuracy in the human-computer dialog, or only solve the intention classification problem in the latter half after ASR results come out, and after intention classification, the human-computer interaction process may also require interactive operations of content acquisition, NLG generation, and TTS playing.
In step 520, it is detected whether the training data set corresponding to each second customized speech model covers the model data source.
In some cases, in order to improve ASR recognition accuracy, the method may be operated by a custom acoustic model AM and a custom language model LM, where the custom AM may need to upload a few hours of human voice audio files and corresponding annotation results to a cloud acoustic model training system, and the custom LM may need to upload MB-level text to the cloud language model training system. The AM and LM output from the cloud training system need to be deployed to an ASR service runtime environment, and then the final ASR result can be influenced.
If the detection result in step 520 indicates that the training data set overlays the model data source, then it jumps to step 530. If the detection result in step 520 indicates that the training data set does not cover the model data source, it jumps to step 540.
In step 530, construction of the corresponding first customized speech model is rejected. Illustratively, a notification message that customized content already exists in the resource may be fed back directly to the user.
In step 540, a corresponding first customized speech model is constructed from the model data sources.
In the embodiment of the invention, when the application operator uploads the model data source, the voice platform server can identify whether the existing customized voice model of the application operator can process the model data source, and when the existing customized voice model is determined to be capable of processing the model data source, the training operation is not required to be carried out again, so that the platform system resource can be saved.
FIG. 6 illustrates a flow diagram of an example of customizing a voice service through a voice service platform, according to an embodiment of the invention. At this time, the voice service platform can play a role of an 'intervention system', namely, a user can intervene in a voice service result by uploading customized information to the platform.
As shown in fig. 6, in step 610, the development client interacts with the voice service platform in a visual manner to upload corresponding customized content.
Illustratively, the application operator's developer can visually view and manage the resources that have been uploaded through the intervening system's pages. Further, the developer's custom resources (or model data sources) can be distinguished into input resources, which can include custom utterances and custom lexicons, and output resources, which can include custom skills, custom intents, custom NLGs, and enterprise knowledge base URLs. Therefore, intervention resources are managed in a visual mode, user experience of developers can be improved, and some human errors are avoided, for example, when a dit.
Through the front-end page provided by the intervention system, developers or operators can be helped to manage the original descriptions and word bank contents corresponding to the products, and misoperation of uploading repeated descriptions and the like by the developers or the operators is reduced.
In step 620, the voice service platform detects whether trained resources already exist in the uploaded customized content.
For example, the intervention system may check the input resources uploaded by the developer through the API, and on the premise that the interface specification is met, the intervention system may perform hash value calculation on the part of the input contents, and then use the hash value to perform distributed storage middleware query to determine whether there are already trained resources. If the trained model resources (such as LM and NLU resources) exist, directly returning resource training success information to the developer; if there are no trained model resources, then the process jumps to step 630, described below.
In step 630, the speech service platform trains the corresponding LM, NLU, and DM models from the customized content.
Illustratively, the intervention system passes the customized content to both the LM training service module and the NLU training service module. If the two training service modules both return a message of successful training, carrying out the following deployment steps; otherwise, returning a message of resource training failure to the developer.
In step 640, the voice service platform can deploy the customized content-based LM, NLU, and DM resources into an ASR service configuration, an NLU service configuration, and a DM service configuration, respectively.
Illustratively, the intervention system may deploy LM, NLU and DM resources to ASR, NLU and DM service runtime environments, respectively.
In step 650, the speech services platform can bind the training model identification and the application product identification.
For example, the intervention system may bind the training model identification and the developer product identification to facilitate identification and location of the corresponding custom model through an identification query operation.
In combination with an application scene, an application development user can open a browser, mark product identification and original resource content on a front-end page, and call and intervene system uploading through the front-end page, wherein the product identification and the original resource content comprise a custom statement, a custom word bank, a custom skill, a custom intention, a custom NLG and a custom enterprise knowledge base URL. After the intervention system judges that the request is legal and the content hash value is not trained, the intervention system can request resource training service, train and produce ASR, NLU and DM resources, issue ASR, NLU and DM models produced by training to distributed data storage middleware used when each service runs, and bind corresponding model identification and application product identification.
In some embodiments, the logic to be completed by the intervention system or the voice service platform may be integrated into the resource training service module for training the customized content, so that one service module may be reduced in a cluster, and one project may be maintained. But at the same time, the content hash may need to be used as a cache key of the customized model, so that the platform operation logic may be coupled with the training model logic, and may cause an unstable factor to the existing resource training service on the line.
Fig. 7 is a flowchart illustrating an example of a voice service method according to an embodiment of the present invention.
As shown in FIG. 7, in step 710, the user uploads the voice stream data and the product identification to the voice service platform via the client for ASR service operations.
Illustratively, voice stream data and a product identification may be uploaded to a recognition service via a system access protocol to recognize whether a bound custom LM resource exists. In addition, if the LM resource exists, the ASR decoder loads the customized LM resource and the built-in AM resource, outputs the ASR result and the confidence coefficient of the a path after calculation, and simultaneously loads the built-in or universal LM resource and the AM resource, and outputs the ASR result and the confidence coefficient of the b path after calculation. And when the results of the two paths a and b and the confidence coefficients are output, selecting the result with higher confidence coefficient as the final ASR result. Preferably, whether the final ASR result comes from a-way or b-way can also be synchronously recorded in the context information.
According to the embodiment of the invention, the recognition service decoder can use the built-in AM and the customized LM to output a single path of recognition result, and selects one of the built-in recognition result and the customized recognition result according to the confidence degree. In some examples, as long as there is a customized LM, the final recognition result is usually from a customized way, and it is possible to quickly optimize the accuracy of the recognition result through simple words and uploading operations of a word library.
In step 720, the central control service module requests NLU service according to the ASR result to obtain a corresponding semantic parsing result.
Illustratively, the central control service module may first query the dialog system context, if the current recognition result comes from the a-way of the product, the central control service module may request the intervention system to query the NLU model identifier customized for the product, and then take this identifier and ASR result in the API requesting the NLU service module; if the current identification result comes from the b-way of the product, the central control service module can directly request the NLU service module. The NLU service module can decide whether to use a customized NLU model or a built-in model (or a general model) to carry out semantic parsing operation according to whether the request message has an additional model identifier, and returns a semantic parsing result to the central control service. In some implementations, the semantic parsing results may include skill information, intent information, and semantic slot value information.
Through the embodiment of the invention, the NLU service module can be preferentially matched with the output of the custom statement during semantic parsing calculation. For example, built-in semantic resources can analyze the 'i want to go to pizza' to the navigation skills, and the 'i want to go to pizza' can be conveniently analyzed to the music skills by customizing the technical means of the NLU model.
In step 730, the central service module may request the DM service module to determine the corresponding reply content information and call the NLG service module to determine the corresponding dialog reply audio.
For example, after obtaining the semantic parsing result, the central control service module may determine whether there is a customized DM resource for the product according to the question information of the dialog system, if there is a customized DM resource, the central control service module will bring the customized model identifier when requesting the DM service module, otherwise, the central control service module will not bring the customized model identifier. The DM service module may decide whether to use the custom DM model or the built-in model according to whether there is an additional model identification in the requesting body. If the dialogue process involves content query, it may use URL in DM customization model (for example, through DSK (Developer Skill Kit) protocol and Developer self-built server to interact, access enterprise database), and the content indicated by the DM result may get corresponding dialogue reply audio through NLG service module. Therefore, when the DM service module is used for dialogue calculation, whether the URL of the enterprise knowledge base service in the customized DM model is accessed or not is determined according to the corresponding configuration, and customized answers can be realized according to product requirements.
In combination with the application scenario, a user uploads a product identifier and voice stream data to an access service according to a voice data access protocol, and the access service module can request the ASR service module, where the request message carries the application product identifier. The ASR service module can use the product identification request intervention system to inquire whether the application product has a customized LM model, can load the customized LM model through an ASR service decoder to perform parallel decoding, outputs two paths of recognition results of a and b, and finally selects a final recognition result by a method of comparing confidence degrees.
Then, the ASR service module may return the speech recognition result and the source information of the recognition result to the access service module, so as to transmit the received speech recognition result and the source information of the recognition result to the downstream central control service module. If the source of the recognition result is a way, the central control service module can request the intervention system to obtain the NLU model identification corresponding to the product, and request a downstream NLU service module by using the recognition result and the NLU model identification (if the NLU model identification exists).
Then, the NLU service module can identify whether the customized NLU model identification exists, if so, the customized NLU model is loaded by the distributed storage middleware during semantic calculation, and a semantic analysis result is returned to the central control service module.
The central control service module may request a downstream DM service module using the semantic parsing result and the DM model identity, so that the DM service module identifies whether a custom DM model identity exists in the request. If so, the custom DM model is loaded by the distributed storage middleware at the time of dialogue computation.
In addition, if the DM model needs to be customized by using an enterprise knowledge base self-built by a developer, the DM service module can also query the enterprise knowledge base to obtain a content query result, help the decision to generate a final NLG, return the NLG according to the original path and finally feed back the NLG result to the user.
By the embodiment of the invention, the recognition result (such as key information of hit skills, NLG and the like) in the voice conversation process can be quickly corrected into other expected results. For example, when the intervention is performed on the speech recognition result, the name of the new hotword "luxiqin" is not included in the ASR service built-in resource, at this time, the user says "i wants to see the talk show of luxiqin", and the ASR service module recognizes that "i wants to see the talk show of luxiqin", so that the user intention cannot be hit. Correspondingly, by the embodiment of the invention, product operators can add the name of the Lixueqin to the corresponding word stock through the interface, so that the problem that the voice recognition result is not in line with expectation can be quickly solved. Therefore, through the intervention system, the rapid iterative optimization model can be realized, the requirement of the personalized service scene is met, the retraining or the configuration of the built-in LM resource is not needed, and the less time consumption can be realized.
Described in connection with another business scenario example, when intervening on semantic parsing results, if a product image of a certain product wants to borrow a company, it is desirable to ask a question "who developed you? "the NLG of the reply can be decided by the enterprise itself, for example, the corresponding reply result can be obtained through the enterprise knowledge base" we are from a young internet company, the founders are all graduate to the cambridge university, focus on the artificial intelligence race track ". Therefore, aiming at the same problem, different application products can realize personalized reply results by means of the intervention system of the voice service platform.
Fig. 8 is a block diagram showing an example of a voice service apparatus according to an embodiment of the present invention.
As shown in fig. 8, the speech service apparatus 800 includes an application identification obtaining unit 810, a customized speech model detecting unit 820, a customized speech model using unit 830, and a feedback service unit 840.
The application identification obtaining unit 810 is configured to obtain a target application identification in a user voice request.
The customized speech model detection unit 820 is configured to detect whether a target customized speech model corresponding to the target application identification exists in a model database, wherein the model database comprises a plurality of customized speech models and corresponding application identifications.
The customized speech model using unit 830 is configured to determine a first speech processing result corresponding to user speech data in the user speech request using the target customized speech model when the presence of the target customized speech model is detected.
The feedback service unit 840 is configured to perform a feedback service operation according to the first voice processing result.
The apparatus according to the embodiment of the present invention may be configured to execute the corresponding method embodiment of the present invention, and accordingly achieve the technical effect achieved by the method embodiment of the present invention, which is not described herein again.
In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).
In another aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the program is executed by a processor to perform the steps of the above voice service method.
The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.
Electronic devices of embodiments of the present invention exist in a variety of forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication functions and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has the functions of calculation and processing, and generally has the mobile internet access characteristic. Such terminals include PDA, MID, and UMPC devices, such as ipads.
(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.
(4) And other electronic devices with data interaction functions.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a general hardware platform, and may also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A voice service method, comprising:
acquiring a target application identifier in a user voice request;
detecting whether a target customized voice model corresponding to the target application identifier exists in a model database, wherein the model database comprises a plurality of customized voice models and corresponding application identifiers;
when the target customized voice model is detected to exist, determining a first voice processing result corresponding to user voice data in the user voice request by using the target customized voice model; and
performing a feedback service operation according to the first voice processing result,
wherein detecting whether a target customized voice model corresponding to the target application identifier exists in the model database comprises:
determining whether a target customized voice model identification corresponding to the target application identification exists according to an application model identification mapping table, wherein the application model identification mapping table comprises a plurality of application identifications and corresponding customized voice model identifications, and the target customized voice model identification is used for positioning the corresponding target customized voice model in the model database,
wherein the customized speech model comprises a plurality of speech model modules, each of the speech model modules being configured to have a corresponding model module identification, and the customized speech model identification comprises one or more customized speech model module identifications,
wherein the determining whether a target customized voice model identifier corresponding to the target application identifier exists according to the application model identifier mapping table includes:
determining at least one target customized voice model module identifier corresponding to the target application identifier according to an application model identifier mapping table;
aiming at each voice processing stage for responding to a voice request of a user, acquiring a voice model module type corresponding to the voice processing stage;
detecting whether each voice model type has a corresponding customized voice model module identifier in the at least one target customized voice model module identifier;
and if the corresponding customized voice model module identification does not exist aiming at the first voice model module type, performing the voice processing operation of the corresponding voice processing stage by utilizing the universal voice model module corresponding to the first voice model module type.
2. The method of claim 1, wherein the performing a feedback service operation according to the first speech processing result comprises:
determining a second voice processing result corresponding to the user voice data by using a universal model;
determining a target voice processing result from the first voice processing result and the second voice processing result according to the corresponding confidence degrees of the first voice processing result and the second voice processing result;
and performing feedback service operation according to the target voice processing result.
3. The method of claim 1, wherein the method further comprises:
obtaining a model data source and a corresponding first application identifier;
determining a corresponding first customized voice model according to the model data source;
updating the model database based on the first customized speech model and the first application identification.
4. The method of claim 3, wherein said determining a respective first customized speech model from the model data sources comprises:
determining at least one second customized speech model in the model database corresponding to the first application identification;
detecting whether a training data set corresponding to each second customized voice model covers the model data source;
if the model data source is covered, refusing to construct a corresponding first customized voice model;
and if the model data source is not covered, constructing a corresponding first customized voice model according to the model data source.
5. The method of claim 3, wherein said obtaining a model data source and a corresponding first application identification comprises:
sending a platform authorization login notification to a development client so that the development client displays a model customization interface with a customized file uploading control, wherein the customized file uploading control is used for uploading a model data source and a first application identifier;
receiving the model data source and the first application identification from the development client.
6. A voice service apparatus comprising:
the application identification acquisition unit is configured to acquire a target application identification in a user voice request;
a customized voice model detection unit configured to detect whether a target customized voice model corresponding to the target application identifier exists in a model database, wherein the model database comprises a plurality of customized voice models and corresponding application identifiers;
a customized voice model using unit configured to determine a first voice processing result corresponding to user voice data in the user voice request by using the target customized voice model when the target customized voice model is detected to exist; and
a feedback service unit configured to perform a feedback service operation according to the first voice processing result,
wherein the customized speech model detection unit is further configured to:
determining whether a target customized voice model identification corresponding to the target application identification exists according to an application model identification mapping table, wherein the application model identification mapping table comprises a plurality of application identifications and corresponding customized voice model identifications, and the target customized voice model identification is used for positioning the corresponding target customized voice model in the model database,
wherein the customized speech model comprises a plurality of speech model modules, each of the speech model modules being configured to have a corresponding model module identification, and the customized speech model identification comprises one or more customized speech model module identifications,
wherein the determining whether a target customized voice model identifier corresponding to the target application identifier exists according to the application model identifier mapping table includes:
determining at least one target customized voice model module identifier corresponding to the target application identifier according to an application model identifier mapping table;
aiming at each voice processing stage for responding to a voice request of a user, acquiring a voice model module type corresponding to the voice processing stage;
detecting whether each voice model module type has a corresponding customized voice model module identifier in the at least one target customized voice model module identifier;
and if the corresponding customized voice model module identification does not exist aiming at the first voice model module type, performing the voice processing operation of the corresponding voice processing stage by utilizing the universal voice model module corresponding to the first voice model module type.
7. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-5.
8. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 5.
CN202011623769.0A 2020-12-31 2020-12-31 Voice service method and device Active CN112837683B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011623769.0A CN112837683B (en) 2020-12-31 2020-12-31 Voice service method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011623769.0A CN112837683B (en) 2020-12-31 2020-12-31 Voice service method and device

Publications (2)

Publication Number Publication Date
CN112837683A CN112837683A (en) 2021-05-25
CN112837683B true CN112837683B (en) 2022-07-26

Family

ID=75924318

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011623769.0A Active CN112837683B (en) 2020-12-31 2020-12-31 Voice service method and device

Country Status (1)

Country Link
CN (1) CN112837683B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113593558A (en) * 2021-07-28 2021-11-02 深圳创维-Rgb电子有限公司 Far-field voice adaptation method, device, equipment and storage medium
CN113793612B (en) * 2021-09-15 2024-04-09 京东科技信息技术有限公司 Updating method and device of model service and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10832664B2 (en) * 2016-08-19 2020-11-10 Google Llc Automated speech recognition using language models that selectively use domain-specific model components
KR20180074210A (en) * 2016-12-23 2018-07-03 삼성전자주식회사 Electronic device and voice recognition method of the electronic device
CN111739519A (en) * 2020-06-16 2020-10-02 平安科技(深圳)有限公司 Dialogue management processing method, device, equipment and medium based on voice recognition

Also Published As

Publication number Publication date
CN112837683A (en) 2021-05-25

Similar Documents

Publication Publication Date Title
CN111860753B (en) Directed acyclic graph-based framework for training models
KR102204740B1 (en) Method and system for processing unclear intention query in conversation system
CN109804428B (en) Synthesized voice selection for computing agents
CN111033492B (en) Providing command bundle suggestions for automated assistants
US10482884B1 (en) Outcome-oriented dialogs on a speech recognition platform
US11763092B2 (en) Techniques for out-of-domain (OOD) detection
WO2021232957A1 (en) Response method in man-machine dialogue, dialogue system, and storage medium
JP2020161153A (en) Parameter collection and automatic dialog generation in dialog systems
US11749276B2 (en) Voice assistant-enabled web application or web page
CN114424185A (en) Stop word data augmentation for natural language processing
CN115917553A (en) Entity-level data augmentation to enable robust named entity recognition in chat robots
KR102170088B1 (en) Method and system for auto response based on artificial intelligence
CN112837683B (en) Voice service method and device
CN115398436A (en) Noise data augmentation for natural language processing
US20220020358A1 (en) Electronic device for processing user utterance and operation method therefor
CN116583837A (en) Distance-based LOGIT values for natural language processing
CN116235164A (en) Out-of-range automatic transition for chat robots
EP4252149A1 (en) Method and system for over-prediction in neural networks
CN115148212A (en) Voice interaction method, intelligent device and system
CN116547676A (en) Enhanced logic for natural language processing
CN111399629B (en) Operation guiding method of terminal equipment, terminal equipment and storage medium
CN116724306A (en) Multi-feature balancing for natural language processors
CN111797636B (en) Offline semantic analysis method and system
US11893996B1 (en) Supplemental content output
US11908463B1 (en) Multi-session context

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Co.,Ltd.

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant