CN112289314A

CN112289314A - Voice processing method and device

Info

Publication number: CN112289314A
Application number: CN202011039618.0A
Authority: CN
Inventors: 邓练兵; 高妍; 陈小满
Original assignee: Zhuhai Dahengqin Technology Development Co Ltd
Current assignee: Zhuhai Dahengqin Technology Development Co Ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2021-01-29

Abstract

The embodiment of the invention provides a voice processing method and a voice processing device, which are applied to a voice processing platform, wherein the voice processing platform is deployed in a developer portal system, the developer portal system is integrated in a city portal system, and the method comprises the following steps: receiving voice processing service requests proposed by users of the plurality of service platforms; acquiring the voice data to be processed appointed by the user; responding the voice processing service request, and calling the voice processing service interface to perform voice processing on the voice data to be processed to obtain a processing result; and sending the processing result to the user. In the urban portal system, the voice processing service requests sent by other service platforms are received by the voice processing platform deployed in the developer portal system, and the voice processing service interface is called for processing, so that users of other service platforms of the urban portal system can obtain processing results obtained after voice processing.

Description

Voice processing method and device

Technical Field

The present invention relates to the field of network technologies, and in particular, to a voice processing method and a voice processing apparatus.

Background

With the development of technologies such as computers, networks, communication and the like, the artificial intelligence deep application is taken as a way to construct urban construction conforming to a cross-domain multi-dimensional concept, technical fusion, business fusion, data fusion and application fusion are promoted, cross-level, cross-region, cross-system, cross-department and cross-business cooperative management and service are realized, the urban construction is accelerated by cooperating with various industries, enterprises and teams, and the method is a development direction of smart cities.

At present, most urban developments have no multi-dimensional fusion planning scheme, and the problems that information platforms of various industries, enterprises and teams are not related and mutual-aided in function, information is not shared and exchanged, and information, service flow and application are mutually disjointed cause that the information in the information platforms cannot be fully utilized.

The urban portal system for realizing multi-dimensional fusion aiming at the city and how to make the urban portal system more beneficial to the use of users are the solutions urgently needed at present.

Disclosure of Invention

In view of the above problems, embodiments of the present invention are proposed to provide a speech processing method and a corresponding speech processing apparatus that overcome or at least partially solve the above problems.

In order to solve the above problems, an embodiment of the present invention discloses a voice processing method applied to a voice processing platform, where the voice processing platform is deployed in a developer portal system, the developer portal system is integrated in a city portal system, a voice processing service interface provided by the voice processing platform is registered in advance in a unified service gateway of the city portal system and is published, the city portal system further includes a plurality of service platforms, and the method includes:

receiving voice processing service requests proposed by users of the plurality of service platforms;

acquiring the voice data to be processed appointed by the user;

responding the voice processing service request, and calling the voice processing service interface to perform voice processing on the voice data to be processed to obtain a processing result;

and sending the processing result to the user.

Optionally, the multiple service platforms include a front-end system, and the acquiring the to-be-processed voice data specified by the user includes:

establishing real-time streaming media communication connection with the front-end system;

and acquiring voice data to be processed from a front-end system of the user through the real-time streaming media communication connection.

Optionally, before the responding to the voice processing service request and calling the voice processing service interface to perform voice processing on the voice data to be processed to obtain a processing result, the method further includes:

carrying out endpoint detection on the voice data to be processed to obtain effective voice data;

the responding the voice processing service request, calling the voice processing service interface to perform voice processing on the voice data to be processed to obtain a processing result, and the processing result comprises:

and responding to the voice processing service request, and calling the voice processing service interface to perform voice processing on the effective voice data to obtain a processing result.

Optionally, before the sending the processing result to the user, the method further includes:

optimizing the processing result; the optimization process comprises the following steps: smooth processing of spoken language, punctuation addition processing and reverse text standardization ITN processing;

the sending the processing result to the user includes:

and sending the processing result after the optimization processing to the user.

Optionally, the responding the voice processing service request, invoking the voice processing service interface to perform voice processing on the voice data to be processed, and obtaining a processing result, includes:

acquiring a preset voice processing model for the user;

and responding to the voice processing service request, calling the voice processing service interface to perform voice processing on the voice data to be processed according to the preset voice processing model, and obtaining a processing result.

Optionally, the method further comprises:

acquiring a word bank uploaded by the user;

and training a preset voice processing model by adopting the word bank uploaded by the user.

Optionally, the voice processing service interface comprises a recording file recognition service interface, and/or a real-time voice recognition service interface, and/or a short-time voice recognition service interface, and/or a voice synthesis service interface, the voice data to be processed comprises a recording file, and/or real-time voice data, and/or short-time voice data, and/or text data, and the voice processing service request comprises a recording file recognition service request, and/or a real-time voice recognition service request, and/or a short-time voice recognition service request, and/or a voice synthesis service request;

responding to the sound recording file identification service request, and calling the sound recording file identification service interface to identify the sound recording file to obtain an identification result;

and/or responding to the real-time voice recognition service request, and calling the real-time voice recognition service interface to perform real-time recognition processing on the real-time voice data to obtain a recognition result;

and/or responding to the short-time voice recognition service request, and calling the short-time voice recognition service interface to perform recognition processing on the short-time voice data to obtain a recognition result;

and responding to the voice synthesis service request, and calling the voice synthesis service interface to perform voice synthesis processing on the text data to obtain a synthesized voice result.

The embodiment of the invention also discloses a voice processing device, which is applied to a voice processing platform, wherein the voice processing platform is deployed in a developer portal system, the developer portal system is integrated in a city portal system, a voice processing service interface provided by the voice processing platform is registered in advance in a unified service gateway of the city portal system and is issued, the city portal system further comprises a plurality of service platforms, and the device comprises:

the service request receiving module is used for receiving voice processing service requests provided by users of the plurality of service platforms;

the voice data to be processed acquisition module is used for acquiring the voice data to be processed appointed by the user;

the voice processing module is used for responding to the voice processing service request and calling the voice processing service interface to perform voice processing on the voice data to be processed to obtain a processing result;

and the processing result sending module is used for sending the processing result to the user.

Optionally, the multiple service platforms include a front-end system, and the to-be-processed voice data acquiring module includes:

the communication connection establishing submodule is used for establishing real-time streaming media communication connection with the front-end system;

and the voice data to be processed acquisition submodule is used for acquiring the voice data to be processed from the front-end system of the user through the real-time streaming media communication connection.

the effective voice data acquisition module is used for carrying out endpoint detection on the voice data to be processed to obtain effective voice data;

the voice processing module comprises:

and the voice processing submodule is used for responding to the voice processing service request and calling the voice processing service interface to perform voice processing on the effective voice data to obtain a processing result.

the optimization processing module is used for optimizing the processing result; the optimization process comprises the following steps: smooth processing of spoken language, punctuation addition processing and reverse text standardization ITN processing;

the processing result sending module comprises:

and the processing result sending submodule is used for sending the processing result after the optimization processing to the user.

Optionally, the speech processing module includes:

the voice processing model obtaining sub-module is used for obtaining a preset voice processing model aiming at the user;

and the voice processing model processing submodule is used for responding to the voice processing service request, calling the voice processing service interface to perform voice processing on the voice data to be processed according to the preset voice processing model, and obtaining a processing result.

Optionally, the speech processing module further comprises:

the word stock acquisition sub-module is used for acquiring the word stock uploaded by the user;

and the voice processing model training submodule is used for training a preset voice processing model by adopting the word bank uploaded by the user.

Optionally, the voice processing service interface includes a recording file recognition service interface, the to-be-processed voice data includes a recording file, and the voice processing service request includes a recording file recognition service request; the voice processing module comprises:

and the first voice processing submodule is used for responding to the recording file identification service request, calling the recording file identification service interface to identify the recording file and obtaining an identification result.

Optionally, the voice processing service interface includes a real-time voice recognition service interface, the to-be-processed voice data includes real-time voice data, and the voice processing service request includes a real-time voice recognition service request; the voice processing module comprises:

and the second voice processing submodule is used for responding to the real-time voice recognition service request and calling the real-time voice recognition service interface to perform real-time recognition processing on the real-time voice data to obtain a recognition result.

Optionally, the voice processing service interface includes a short-time voice recognition service interface, the to-be-processed voice data includes short-time voice data, and the voice processing service request includes a short-time voice recognition service request; the voice processing module comprises:

and the third voice processing submodule is used for responding to the short-time voice recognition service request and calling the short-time voice recognition service interface to perform recognition processing on the short-time voice data to obtain a recognition result.

Optionally, the voice processing service interface includes a voice synthesis service interface, the to-be-processed voice data includes text data, and the voice processing service request includes a voice synthesis service request; the voice processing module comprises:

and the fourth voice processing submodule is used for responding to the voice synthesis service request and calling the voice synthesis service interface to carry out voice synthesis processing on the text data to obtain a synthesized voice result.

The embodiment of the invention also discloses an electronic device, which comprises: a processor, a memory and a computer program stored on the memory and capable of running on the processor, the computer program, when executed by the processor, implementing the steps of any of the speech processing methods.

The embodiment of the invention also discloses a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and the computer program is used for realizing any step of the voice processing method when being executed by a processor.

The embodiment of the invention has the following advantages:

in the embodiment of the present invention, the voice processing platform deployed in the developer portal system may receive voice processing service requests provided by users of multiple service platforms in the urban portal system, acquire to-be-processed voice data specified by the users who provide the requests, perform voice processing on the acquired to-be-processed voice data by calling a corresponding voice processing service interface, and return a processing result obtained by the processing to the users who provide the requests. In the urban portal system, the voice processing service requests sent by other service platforms are received by the voice processing platform deployed in the developer portal system, and the voice processing service interface is called for processing, so that users of other service platforms of the urban portal system can obtain processing results obtained after voice processing.

Drawings

FIG. 1 is a block diagram of a city portal system of an embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of a first embodiment of a speech processing method according to the present invention;

FIG. 3 is a flowchart illustrating steps of a second embodiment of a speech processing method;

FIG. 4 is a schematic flow chart illustrating speech processing of speech data to be processed according to an embodiment of the present invention;

FIG. 5 is a flow chart illustrating a process of a sentence recognition service in an embodiment of the present invention;

fig. 6 is a block diagram of a speech processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The city portal system is a public platform portal which integrates comprehensive internet information aiming at city planning and provides comprehensive application services. The service objects of the city portal system include: government, individual, enterprise, developer, can provide a plurality of comprehensive services such as government affairs service, individual service, enterprise service, etc.

Referring to fig. 1, a block diagram of a city portal system according to an embodiment of the present invention is shown, which may specifically include: a front-end system 10, a back-end system 11, an API open platform 12, a developer portal system 13, an operation center 14, and the like.

A front-end system 10 that implements a plurality of functions and provides a plurality of pages; the plurality of pages include a plurality of UI elements corresponding to the plurality of functions; the plurality of functions includes a function supported by a backend system to provide a service.

The front-end system is a client facing the user, and is used as a tool for the user to use various contents in the urban portal system, and the user can use information, applications, API services, system functions and the like in the urban portal system through the front-end system. The front-end system displays a universal front-end development framework such as Vue, React and the like to realize unified presentation of a single page of the multi-service system.

The front-end system supports multi-dimensional user use, including tourists, natural people, corporate legal people, enterprise employees and government personnel. The front-end system supports multi-dimensional business city services, including government affair services, public services, characteristic services and the like. The user may use a variety of city services through the head-end system.

The front-end system can comprise an APP client, a Web client and a Web management end, wherein the Web client faces tourists, natural people, enterprises and government users and is used for city portal system official networks, API open platforms and developer portals. The APP client faces tourists, natural people, enterprises and government users and is used for moving the APP; the Web management end faces to operators and system managers and is used for operation centers and other back-end management systems.

And the back-end system 11 is configured to provide service support for the front-end system, respond to a service request of the front-end system, and execute a corresponding service operation.

And the back-end system takes the universal service component or the technical service as a bridge to get through the bottom data. The front-end system is decoupled from the back-end system, and the back-end system provides service support for the front-end system. The front-end system and the back-end system are separately deployed, and the back-end system serves dynamic capacity expansion to achieve the maximum performance of the system.

And the API open platform 12 is used for providing management services aiming at the API, including API publishing services, purchasing services and using services.

The API open platform provides a unified standard data and system development environment, can be applied to various industries and systems, is an open comprehensive service platform with unified solution capability service, and aims to realize the management and control of the whole flow life cycle of unified capability opening including service capability access, open management, capability application and the like. By publishing the API services to the API open platform, developers can apply for or purchase use on the platform by other persons.

A developer portal system 13 for providing an environment for API, application, data development and deployment, and common procedural components.

The developer portal system can provide the developer with an environment for application, algorithm, and data development and deployment, as well as generic AI building blocks, technical building blocks, and business building blocks. The method comprises the functions of application development, algorithm development, application release, service release and the like. Developers can quickly develop and publish applications and APIs based on the environment, components, templates, etc. provided by the developer portal system. The developer portal system may include three platforms, an algorithm development platform, an application development platform, and a data development platform.

And the operation center 14 is used for managing the content, the users, the applications and the API of the city portal system.

The operation center is a business center station which provides unified daily operation management for managers and operators to the urban portal system. The management personnel can carry out unified management on the content, the users, the applications, the API and the like of the city portal system through the operation center.

The embodiment of the invention provides an urban portal system which can integrate a front-end system, a back-end system, an API open platform, a developer portal system, an operation center and other platforms. The front-end system serves as a client and faces various users, and the users can obtain contents provided by various platforms integrated in the urban portal system by performing operations on the front-end system. The back-end system provides service support for the front-end system, responds to the service request of the front-end system and executes corresponding service operation. The API open platform provides management services aiming at the API, including API publishing services, purchasing services and using services; the method can be used for developers to call API uniformly and establish a standard and uniform information platform. The developer portal system can provide an environment for API, application, data development and deployment, and general purpose procedural components; and the development of various services can be realized by developers. The operation center can manage the content, users, applications and API of the city portal system. The embodiment of the invention provides a comprehensive city portal system for a city, which is oriented to various users in the city, and the users can quickly and conveniently realize various digital services through the city portal system; and standardized service development is realized through the city portal system.

Referring to fig. 2, a flowchart of a first embodiment of a speech processing method according to the present invention is shown, and is applied to a speech processing platform, where the speech processing platform is deployed in a developer portal system, the developer portal system is integrated in a city portal system, a speech processing service interface provided by the speech processing platform is registered in advance in a unified service gateway of the city portal system and is published, the city portal system further includes a plurality of service platforms, and specifically may include the following steps:

step 201, receiving voice processing service requests proposed by users of the plurality of service platforms;

in an embodiment of the present invention, the city portal system may include a developer portal system and other service platforms, the developer portal system and the other service platforms are in communication with each other, and the voice processing platform deployed in the developer portal system may receive voice processing service requests provided by users of the other service platforms, so as to respond to the service requests provided by the users and perform corresponding processing.

When users of other multiple service platforms provide voice processing service requests for a voice processing platform deployed in a developer portal system, the service communication requests between the multiple service platforms and the developer portal system in the urban portal system are mainly expressed, and when the urban portal system detects the service communication requests provided by the users of the multiple service platforms, the urban portal system can perform identity authentication on the users providing the service requests, namely under the condition that the identity authentication platform judges that session (time domain) sessions corresponding to the users and the session sessions corresponding to the users are effective, the service requests provided by the users of the multiple service platforms are sent to the voice processing platform deployed in the developer portal system.

Step 202, obtaining the voice data to be processed specified by the user;

in practical application, after receiving voice processing service requests provided by users of other multiple platforms, to-be-processed voice data to be subjected to voice processing by the user needs to be acquired, where the to-be-processed voice data may be the to-be-processed voice data specified by the user, and the to-be-processed voice data specified by the user may be data carried in the received voice processing request or data from other sources, which is not limited in the embodiment of the present invention.

Step 203, responding to the voice processing service request, and calling the voice processing service interface to perform voice processing on the voice data to be processed to obtain a processing result;

step 204, sending the processing result to the user.

In an embodiment of the present invention, after receiving a voice processing service request provided by a user of another service platform and acquiring to-be-processed voice data specified by the user, the voice processing platform may provide a corresponding voice processing service for the user, and specifically, may respond to the voice processing service request provided by the user, and invoke a voice processing service interface corresponding to the voice processing service to perform voice processing on the to-be-processed voice data, to obtain a processed processing result, and send the processing result to the user who provided the service request.

Referring to fig. 3, a flowchart illustrating steps of a second embodiment of a speech processing method according to the present invention is shown, and is applied to a speech processing platform, and specifically includes the following steps:

step 301, receiving a voice processing service request initiated by a user, and acquiring voice data to be processed from a client of the user;

in an embodiment of the present invention, a voice processing platform deployed in a developer portal system may receive a voice processing service request provided by a user, and acquire to-be-processed voice data specified by the user, so as to respond to the service request provided by the user and perform corresponding processing on the acquired to-be-processed voice data.

The user who proposes the voice processing service request can be a login user of a developer portal system integrated in the urban portal system, and can also be a user of other business platforms except the developer portal system in the urban portal system; for the way of making a voice processing service request, after a user logs in a developer portal system or other business platforms, the user may appear on a main page of the system or a main page of the platform, where the main page may include an operation bar for the voice processing service request or a link entry for the voice processing service request, and at this time, the user may make a request for the voice processing service through a touch operation of the logged-in user acting on the operation bar or the link entry.

In one embodiment of the present invention, step 301 may include the following sub-steps:

substep S11, establishing a real-time streaming media communication connection with the front-end system;

and a substep S12 of obtaining the voice data to be processed from the user' S front-end system through the real-time streaming media communication connection.

In practical applications, a user may make a voice processing service request on a plurality of service platforms, and the plurality of service platforms may include a front-end system, where the voice processing platform may establish a Real-Time Streaming Protocol (RTSP) connection with the front-end system of the plurality of service platforms through the RTSP connection, so that the voice data to be processed may be obtained from the front-end system of the user through the established Real-Time Streaming communication connection. Step 302, performing endpoint detection on the voice data to be processed to obtain effective voice data;

in an embodiment of the present invention, before responding to a voice processing service request provided by a user of another service platform and invoking a voice processing service interface to perform voice processing on voice data to be processed and obtaining a processing result, endpoint detection may be performed on the voice data to be processed to obtain valid voice data, so as to call the voice processing service interface to perform voice processing on the valid voice data to obtain the processing result.

In practical applications, the endpoint detection of the voice data to be processed refers to performing endpoint detection on an input original pcm voice stream (which may be referred to as the voice data to be processed acquired from a front-end system of a user through a real-time streaming media communication connection), where the voice data obtained after the endpoint detection is valid voice data of a valid voice part.

Step 303, calling a corresponding voice processing service interface to perform voice processing service on the voice data to be processed;

in the concrete implementation, the received voice processing service interface is responded, the voice processing service interface can be called through the gateway, and voice processing is carried out on voice data to be processed, wherein the voice processing service interface is a service interface which is registered and issued in advance to a unified service gateway of the urban portal system and is used for providing voice processing service.

The voice processing refers to the audio processing of the voice data packet of the audio stream, and can support REST API interface, deep semantic parsing, self-defined recognition word stock, voice recognition and the like. The supported REST API interface refers to an HTTP request mode which is applicable to communication between any other platform and a voice processing platform to realize voice processing service; deep semantic parsing supported refers to semantic understanding that can be done in multiple domains, such as traffic, social, entertainment, etc.; the supported user-defined recognition word stock refers to the setting that a user-defined instruction set and a question and answer pair can be supported so as to more accurately understand the intention of the user; the supported speech recognition may include far-field speech recognition, near-field speech recognition, and voice wake-up, among others.

The received voice processing service request may include different service requests, the different service requests correspond to different voice processing service interfaces, and at this time, the voice processing service interface matched with the type of the service request may be called to process the voice data to be processed.

In one embodiment of the present invention, step 303 may include the following sub-steps:

the substep S21 is used for responding to the sound recording file identification service request and calling the sound recording file identification service interface to carry out identification processing on the sound recording file to obtain an identification result;

specifically, the voice processing service request may include a recording file identification service request, the voice processing service interface invoked in response to the recording file identification service request may be a recording file identification service interface, the service provided by the recording file identification service interface may be a recording file identification service for the recording file, and at this time, the recording file identification service may be used to perform identification processing on the recording file to obtain an identification result for the recording file.

The recording file identification service can be registered in advance in a unified service gateway of the city portal system and issued, a recording file identification service interface can be provided in a REST API mode, a recording file needing identification processing can be placed on a certain server, and the server can be accessed through a URL. The REST API (i.e. the sound file identification service interface) for sound file identification may include two parts, a sound file identification service request interface in a POST manner and a sound file identification result query interface in a GET manner, where the sound file identification service request interface may be POST/stream/v 1/files, and the sound file identification result query interface may be GET/stream/v 1/files.

The substep S22, responding to the real-time voice recognition service request, calling the real-time voice recognition service interface to perform real-time recognition processing on the real-time voice data, and obtaining a recognition result;

in another case, the voice processing service request may include a real-time voice recognition service request, the voice processing service interface invoked in response to the real-time voice recognition service request may be a real-time voice recognition service interface, the service provided by the real-time voice recognition service interface may be a real-time voice recognition service for real-time voice data, and at this time, the real-time voice recognition service may be used to perform real-time recognition processing on the real-time voice data to obtain a recognition result for the real-time voice data.

The method comprises the steps that real-time voice data are identified and processed in real time through a real-time voice identification service, authentication operation can be conducted firstly, namely, token authentication can be conducted when Websocket links are established between a client side and a server side corresponding to a voice processing platform; request parameters can be set, so that when the client side initiates a request, the server side can confirm that the initiated request is valid; at this time, the client can send the voice data to the server in a circulating manner and continuously receive the recognition result sent by the server, the client can inform the server that the voice data is sent, and the server can send a recognition completion notice to the client after the voice data is recognized.

Referring to fig. 4, a schematic flowchart of performing voice processing on voice data to be processed in the embodiment of the present invention is shown, where the voice data to be processed may be real-time voice data, the real-time voice data may include an audio stream with unlimited duration, and the voice processing operation performed may be real-time voice recognition, and the schematic flowchart may be applied to a scenario of performing real-time voice recognition on the audio stream with unlimited duration.

The voice processing service of the real-time voice recognition can comprise a preprocessing part, a core recognition part and a post-processing part.

Specifically, the preprocessing part may include two functions of speech decoding and speech endpoint detection, where speech decoding refers to decoding a opu format speech stream, and speech endpoint detection refers to automatically detecting the speech front and back endpoints of a speech stream sent by a user online, and at the same time, the user may configure parameters of an endpoint detection algorithm through some advanced parameters.

The core identification part can comprise two functions of core identification and parameter control. The core recognition means that after the input original pcm voice stream is subjected to end point detection, the detected effective voice part is sent to a real-time voice recognition service in real time for voice recognition, and a recognition result is returned to a user in real time, and the voice recognition supporting multiple languages and dialects is recognized at the same time; the parameter control refers to parameter control of various parameters in the recognition process, such as transfer of hot words, generic hot words, customized models and model parameters.

The post-processing section may perform re-processing of the obtained recognition result including spoken language smoothing (translation detection), punctuation (pointing), and ITN (Inverse Text Normalization).

In a preferred embodiment, during the process of real-time recognition processing of real-time voice data, a function of intelligent sentence-breaking can be provided, that is, a start and end time of each sentence can be provided. Specifically, intelligent sentence breaks can be expressed as automatic sentence breaks and speech sentence breaks.

The realization of automatic sentence interruption can use the voice detection function to automatically detect the time point information of the starting end point and the ending end point of the voice stream sent by the user on line, thereby facilitating the subsequent voice recognition, wherein the time point information of each sentence is only related to the size of data sent by the user, and in addition, if the voice stream does not detect the ending end point, the voice can be forcibly cut off according to the default time of 60 s; the semantic punctuation can be called as streaming punctuation, which means that the text after speech recognition is different from the written language, for example, (1) the recognized text may have recognition errors, (2) there is no punctuation, (3) there is no punctuation at the source end in a long sentence (for example, the text after speech tagging of 40-50 s), (4) a complete sentence is cut into a plurality of short sentences by the VAD endpoint detection-based algorithm due to factors such as hesitation, emotion and speech style of a speaker, and (5) the text is spoken and mixed with the language and the like. This may cause two problems, for example, punctuation based on incomplete sentence fragments, the punctuation algorithm may be inaccurate, affecting reading efficiency; and for incomplete sentence output, the effects of subsequent tasks, such as machine translation, summarization, syntactic analysis, etc., may be severely impacted. Due to the two problems caused by the above, the obtained recognition result can be processed again by the post-processing part, including the processing of smooth spoken language, punctuation and ITN, and a sentence-breaking strategy is added, so that the finally obtained recognition result is a recognition result conforming to the semantics.

The voice processing service of the real-time voice recognition comprises a preprocessing part, a core recognition part and a post-processing part, and can be used for scenes such as video real-time live subtitles, real-time conference recording, real-time court trial recording, intelligent voice assistance and the like.

Substep S23, responding to the speech synthesis service request, calling the speech synthesis service interface to perform speech synthesis processing on the text data, and obtaining a synthesized speech result;

in another case, the speech processing service request may include a speech synthesis service request, the speech processing service interface invoked in response to the speech synthesis service request may be a speech synthesis service interface, and the service provided by the speech synthesis service interface may be a speech synthesis service for text data, where the speech synthesis service may be used to perform speech synthesis processing on the text data to obtain a synthesized speech result for the text data.

The voice synthesis service can be registered in advance in a unified service gateway of the city portal system and issued, and a voice synthesis service interface can be provided in a REST API mode, and the short-time voice recognition service interface can comprise a GET method uploading text interface of GET tread/v 1/tts and a POST method uploading text interface of POST tread/v 1/tts. Specifically, the speech synthesis service interface may support requests of two methods, namely HTTP GET and POST, that is, a text to be synthesized may be uploaded to the server through the client corresponding to the speech processing platform, and the server may return a speech synthesis result of the text.

In a preferred embodiment, a speech synthesis service SDK may also be used to perform speech synthesis processing on the text data, and an authentication operation may be performed first, that is, when a WebSocket link is established between a client and a server corresponding to the speech processing platform, token may be used to perform authentication; request parameters can be set, so that when the client side initiates a request, the server side can confirm that the initiated request is valid; at this time, after the client uploads the text data to the server through two methods, namely POST and GET, the server can start to return synthesized voice binary data, at this time, the voice synthesis service SDK can receive and process the returned binary data, the client can notify the server that the voice data transmission is completed, and the server can return a final voice synthesis result to the client.

And a substep S24, responding to the short-time speech recognition service request, and calling the short-time speech recognition service interface to perform recognition processing on the short-time speech data to obtain a recognition result.

In another case, the voice processing service request may include a short-time voice recognition service request, the voice processing service interface invoked in response to the short-time voice recognition service request may be a short-time voice recognition service interface, and the service provided by the short-time voice recognition service interface may be a short-time voice recognition service for short-time voice data, where the short-time voice recognition service may be used to perform recognition processing on the short-time voice data to obtain a recognition result for the short-time voice data.

The short-time voice recognition service can be registered in advance in a unified service gateway of the urban portal system and issued, a short-time voice recognition service interface can be provided in a REST API mode, and the short-time voice recognition service interface can be POST/stream/v 1/asr.

The short-time speech recognition service can support the whole section of the uploaded speech file with the uploading time of less than one minute, and the recognition result can be returned at one time in the request response in the JSON format, so that in the processing process of performing short-time speech recognition on the short-time speech data, the connection between the speech processing platform and the service platform where the user initiating the speech processing service request is located needs to be continuously maintained before the recognition result is returned. Specifically, the client corresponding to the speech processing platform may send an HTTP REST POST request with audio data to the server, at this time, the server may return an HTTP response with an identification result, and after the client sends the HTTP request for uploading audio to the server, the client may receive a response from the server, where the identification result carried by the client may be stored in the response in the form of a JSON character string.

Referring to fig. 5, a schematic processing flow diagram of a speech recognition service in an embodiment of the present invention is shown, where the speech data to be processed may be short-time speech data, the short-time speech data may include speech with a short duration (within one minute), and the speech processing operation performed may be short-time speech recognition, and the schematic processing flow diagram may be applied to a scenario in which short-time speech recognition is performed on speech with a short duration.

As shown in fig. 5, in the short-time speech recognition processing of the short-time speech data, the short-time speech recognition processing can be realized by a sentence SDK and by calling the short-time speech recognition service interface through the gateway. The short-time recognition processing operation provided by the short-time speech recognition service interface can be obtained by customizing a model aiming at short-time speech recognition through a self-learning platform and training the model by using a generic hot word, a similar hot word and the like. It should be noted that the flowchart can be applied to a shorter voice interaction scenario, such as voice search, voice instruction, voice short message, and the like, and can be integrated into various products such as apps, intelligent home appliances, and intelligent assistants.

The method comprises the steps that a speech SDK can refer to a speech recognition SDK service, at the moment, authentication operation can be firstly carried out, namely token can be adopted for authentication when a WebSocket link is established between a client side and a server side corresponding to a speech processing platform; request parameters can be set, so that when the client side initiates a request, the server side can confirm that the initiated request is valid; at this time, the client can send the voice data to the server in a circulating manner and continuously receive the recognition result sent by the server, the client can inform the server that the voice data is sent, and the server can send a recognition completion notice to the client after the voice data is recognized.

In a preferred embodiment, the voice processing is performed on the voice data to be processed, a preset voice processing model for a user can be obtained, and the voice processing is performed on the voice data to be processed by responding to the voice processing service request and calling the voice processing service interface according to the preset voice processing model, so as to obtain a processing result. The acquired preset speech processing module may be a preset speech processing model obtained by training an acquired word stock uploaded by a user.

Step 304, optimizing the processing result; the optimization process comprises the following steps: smooth processing of spoken language, punctuation addition processing and reverse text standardization ITN processing;

in an embodiment of the present invention, before sending a processing result obtained by calling a voice processing service interface according to a voice processing service request to perform voice processing on voice data to be processed to a user of another service platform, optimization processing may be performed on the processing result to obtain valid voice data, so as to send the processing result after optimization processing to the user.

In practical application, optimization processing may be performed on a processing result obtained by performing speech processing, and the optimization processing may include processing of smooth spoken language, punctuation, and ITN. The smooth spoken language processing mainly focuses on the phenomenon of unsmooth speech, and the speech recognition can be smoother mainly by aiming at the filtering processing of the tone words and the like at present; the punctuation processing means adding punctuation to a test set text by using an automatic punctuation module, wherein the automatic punctuation also belongs to a sequence marking task, and performing modeling by using a statistical model or a neural network model to mark an input recognized result; ITN processing refers to the fact that in most speech recognition systems, the core speech recognizer generates a sequence of tokens in speech form, which is then converted to written form by the ITN process; the ITN may include objects such as numbers, dates, and addresses.

Step 305, sending the processing result after the voice processing to the user.

In an embodiment of the present invention, after the voice processing platform receives a voice processing service request provided by a user of another service platform and obtains the voice data to be processed specified by the user, and invokes a voice processing service interface corresponding to the voice processing service to be provided to perform voice processing on the voice data to be processed, the voice processing platform can send a processed processing result to the user, so that the voice processing service provided by the user of the other service platform is provided by the voice processing platform deployed in the developer portal system in the city portal system.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 6, a block diagram of a voice processing apparatus according to an embodiment of the present invention is shown, and is applied to a voice processing platform, where the voice processing platform is deployed in a developer portal system, the developer portal system is integrated in a city portal system, a voice processing service interface provided by the voice processing platform is registered in advance in a unified service gateway of the city portal system and is published, and the city portal system further includes a plurality of service platforms, and may specifically include the following modules:

a service request receiving module 601, configured to receive a voice processing service request provided by a user of the multiple service platforms;

a to-be-processed voice data obtaining module 602, configured to obtain to-be-processed voice data specified by the user;

the voice processing module 603 is configured to respond to the voice processing service request, and invoke the voice processing service interface to perform voice processing on the to-be-processed voice data to obtain a processing result;

a processing result sending module 604, configured to send the processing result to the user.

In an embodiment of the present invention, the to-be-processed voice data obtaining module 602 may include the following sub-modules:

In an embodiment of the present invention, before the responding to the voice processing service request and invoking the voice processing service interface to perform voice processing on the to-be-processed voice data to obtain a processing result, the following module may be further included:

the speech processing module 603 may include the following sub-modules:

In an embodiment of the present invention, before sending the processing result to the user, the following module may be further included:

the processing result sending module 604 may include the following sub-modules:

In one embodiment of the present invention, the speech processing module 603 may include the following sub-modules:

In an embodiment of the present invention, the speech processing module 603 may further include the following sub-modules:

In an embodiment of the present invention, the voice processing service interface includes a recording file identification service interface, the voice data to be processed includes a recording file, and the voice processing service request includes a recording file identification service request; the speech processing module 603 may include the following sub-modules:

In an embodiment of the present invention, the voice processing service interface includes a real-time voice recognition service interface, the voice data to be processed includes real-time voice data, and the voice processing service request includes a real-time voice recognition service request; the speech processing module 603 may include the following sub-modules:

In an embodiment of the present invention, the voice processing service interface includes a short-time voice recognition service interface, the voice data to be processed includes short-time voice data, and the voice processing service request includes a short-time voice recognition service request; the speech processing module 603 may include the following sub-modules:

In an embodiment of the present invention, the voice processing service interface includes a voice synthesis service interface, the voice data to be processed includes text data, and the voice processing service request includes a voice synthesis service request; the speech processing module 603 may include the following sub-modules:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, and when being executed by the processor, the computer program implements each process of the embodiment of the speech processing method, and can achieve the same technical effect, and is not described herein again to avoid repetition.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements each process of the foregoing speech processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing describes a speech processing method and a speech processing apparatus in detail, and specific examples are applied herein to explain the principles and embodiments of the present invention, and the descriptions of the foregoing embodiments are only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A voice processing method is applied to a voice processing platform, the voice processing platform is deployed in a developer portal system, the developer portal system is integrated in a city portal system, a voice processing service interface provided by the voice processing platform is registered in a unified service gateway of the city portal system in advance and is published, the city portal system further comprises a plurality of service platforms, and the method comprises the following steps:

acquiring the voice data to be processed appointed by the user;

and sending the processing result to the user.

2. The method of claim 1, wherein the plurality of service platforms includes a front-end system, and wherein the obtaining the user-specified voice data to be processed includes:

3. The method according to claim 1, wherein before said responding to the voice processing service request and invoking the voice processing service interface to perform voice processing on the voice data to be processed to obtain a processing result, the method further comprises:

4. The method of claim 1, further comprising, prior to said sending the processing result to the user:

the sending the processing result to the user includes:

5. The method of claim 1, wherein the responding to the voice processing service request, invoking the voice processing service interface to perform voice processing on the voice data to be processed to obtain a processing result, comprises:

acquiring a preset voice processing model for the user;

6. The method of claim 5, further comprising:

acquiring a word bank uploaded by the user;

7. The method according to claim 1, wherein the voice processing service interface comprises a recording file recognition service interface, and/or a real-time voice recognition service interface, and/or a short-time voice recognition service interface, and/or a voice synthesis service interface, the voice data to be processed comprises a recording file, and/or real-time voice data, and/or short-time voice data, and/or text data, and the voice processing service request comprises a recording file recognition service request, and/or a real-time voice recognition service request, and/or a short-time voice recognition service request, and/or a voice synthesis service request;

8. A voice processing apparatus, applied to a voice processing platform, where the voice processing platform is deployed in a developer portal system, the developer portal system is integrated in a city portal system, a voice processing service interface provided by the voice processing platform is pre-registered and published in a unified service gateway of the city portal system, the city portal system further includes a plurality of service platforms, and the apparatus includes:

9. An electronic device, comprising: processor, memory and a computer program stored on the memory and executable on the processor, which computer program, when being executed by the processor, carries out the steps of the speech processing method according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech processing method according to any one of claims 1 to 7.