CN107657017B

CN107657017B - Method and apparatus for providing voice service

Info

Publication number: CN107657017B
Application number: CN201710882420.0A
Authority: CN
Inventors: 谢波
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2017-09-26
Filing date: 2017-09-26
Publication date: 2020-11-13
Anticipated expiration: 2037-09-26
Also published as: CN107657017A

Abstract

Methods and apparatus for providing voice services are disclosed. One embodiment of the method for providing voice services comprises: acquiring a voice input signal; recognizing the tone and the speaking content in the voice input signal by utilizing a semantic recognition model trained by adopting a machine learning method to obtain corresponding tone input information and text input information, wherein the tone input information is used for expressing the tone type of the voice input signal; and carrying out voice service data query based on the tone input information and the text input information, and generating voice response information according to a query result. The implementation mode realizes the tone recognition independent of tone auxiliary words, can more accurately detect the intention of a speaker, and improves the accuracy of voice service.

Description

Method and apparatus for providing voice service

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for providing a voice service.

Background

Artificial Intelligence (AI) is a new technical science to study and develop theories, methods, techniques and application systems for simulating, extending and expanding human Intelligence. Artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence, a field of research that includes robotics, speech recognition, image recognition, natural language processing, and expert systems. Among them, the speech recognition technology in the field of artificial intelligence is an important direction in the fields of computer science and artificial intelligence.

The tone of a speaker during speaking usually includes the required information such as emotion of the speaker, and the existing speech recognition technology mainly recognizes the tone of the speaker by using tone assist words (such as 'mo', 'chan', etc.), and then judges the requirement of the speaker. However, this kind of mood identification method has strong limitations, on one hand, the same mood assist word may correspond to different moods, for example, "do" in "true do" may represent exclamation or question; on the other hand, for the speech which does not include the tone auxiliary words, the tone of the speaker cannot be accurately recognized, and further, the emotion and the intention of the speaker cannot be accurately judged.

Disclosure of Invention

To address one or more of the technical problems noted in the background section above, embodiments of the present application provide a method and apparatus for providing voice services.

In a first aspect, an embodiment of the present application provides a method for providing a voice service, including: acquiring a voice input signal; recognizing the tone and the speaking content in the voice input signal by utilizing a semantic recognition model trained by adopting a machine learning method to obtain corresponding tone input information and text input information, wherein the tone input information is used for expressing the tone type of the voice input signal; and carrying out voice service data query based on the tone input information and the text input information, and generating voice response information according to a query result.

In some embodiments, the performing a voice service data query based on the mood input information and the text input information and generating voice response information according to a query result includes: determining user demand information based on the mood input information and the text input information; inquiring voice service data matched with the user demand information to generate text response information; and converting the text response information into voice response information.

In some embodiments, the performing voice service data query based on the mood input information and the text input information and generating voice response information according to a query result further includes: inquiring tone output information corresponding to the tone input information based on the acquired corresponding relation between the preset tone input information and the preset tone output information, wherein the tone output information is used for identifying the tone of the voice response information to be generated; the converting the text response information into the voice response information includes: and performing text-to-speech conversion on the text response information by combining the tone output information to generate speech response information containing tones.

In some embodiments, the above method further comprises: acquiring a sample conversation set, wherein the sample conversation set comprises a plurality of sections of sample conversations, and the sample conversations comprise audio data of a request text and audio data of a corresponding response text; determining tone information of the corresponding request text according to the audio data of the response text; and taking the audio data of the request text, the request text and the mood information of the request text as training samples, and training the semantic recognition model by adopting a machine learning method.

In some embodiments, the method further comprises the step of constructing a sample dialog set, comprising: collecting dialogue linguistic data containing audio data of a preset request text; extracting the audio data of the response text corresponding to each preset request text from each dialogue corpus; and combining the audio data of each preset request text and the audio data of the corresponding response text to generate a plurality of sections of sample conversations so as to form a sample conversation set.

In a second aspect, an embodiment of the present application provides an apparatus for providing a voice service, including: an acquisition unit for acquiring a voice input signal; the recognition unit is used for recognizing the tone and the speaking content in the voice input signal by utilizing a semantic recognition model trained by adopting a machine learning method to obtain corresponding tone input information and text input information, wherein the tone input information is used for expressing the tone type of the voice input signal; and the response unit is used for carrying out voice service data query based on the tone input information and the text input information and generating voice response information according to a query result.

In some embodiments, the response unit is further configured to generate the voice response information as follows: determining user demand information based on the mood input information and the text input information; inquiring voice service data matched with the user demand information to generate text response information; and converting the text response information into voice response information.

In some embodiments, the response unit is further configured to: inquiring tone output information corresponding to the tone input information based on the acquired corresponding relation between the preset tone input information and the preset tone output information, wherein the tone output information is used for identifying the tone of the voice response information to be generated; and the response unit is further used for converting the text response information into the voice response information as follows: and performing text-to-speech conversion on the text response information by combining the tone output information to generate speech response information containing tones.

In some embodiments, the above apparatus further comprises: the system comprises a sample acquisition unit, a data processing unit and a data processing unit, wherein the sample acquisition unit is used for acquiring a sample conversation set, the sample conversation set comprises a plurality of sections of sample conversations, and the sample conversations comprise audio data of a request text and audio data of a corresponding response text; the determining unit is used for determining tone information of the corresponding request text according to the audio data of the response text; and the training unit is used for taking the audio data of the request text, the request text and the mood information of the request text as training samples and training the semantic recognition model by adopting a machine learning method.

In some embodiments, the apparatus further includes a construction unit for constructing the sample dialog set, the construction unit constructing the sample dialog set as follows: collecting dialogue linguistic data containing audio data of a preset request text; extracting the audio data of the response text corresponding to each preset request text from each dialogue corpus; and combining the audio data of each preset request text and the audio data of the corresponding response text to generate a plurality of sections of sample conversations so as to form a sample conversation set.

According to the method and the device for providing the voice service, the voice input signal is obtained, and then the semanteme and the speaking content in the voice input signal are identified by utilizing the semantic identification model trained by adopting a machine learning method, so that corresponding semanteme input information and text input information are obtained, wherein the semanteme input information is used for representing the semanteme of the voice input signal; and then, voice service data query is carried out based on the tone input information and the text input information, and voice response information is generated according to a query result, so that tone identification independent of tone assisted words is realized, the intention of a speaker can be detected more accurately, and the accuracy of voice service is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a method for providing voice services in accordance with the present application;

FIG. 3 is a schematic diagram of an application scenario according to an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram illustrating another embodiment of a method for providing voice services in accordance with the present application;

FIG. 5 is a block diagram illustrating an embodiment of an apparatus for providing voice services according to the present application;

FIG. 6 is a schematic block diagram of a computer system suitable for use in implementing a server according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the method for providing voice services or the apparatus for providing voice services of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, a network 103, and a server 104. The network 103 serves as a medium for providing communication links between the

terminal devices

101, 102 and the server 104. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user 110 may use the

terminal devices

101, 102 to interact with the server 104 via the network 103 to receive or send messages or the like. Various voice interaction applications may be installed on the

terminal devices

101, 102.

The

terminal devices

101, 102 may be various electronic devices having an audio input interface and an audio output interface and supporting internet access, including but not limited to smartphones, tablet computers, smartwatches, e-books, smartphones, etc.

The server 104 may be a voice server for providing support for voice services, and the voice server may receive and parse the voice interaction request sent by the

terminal device

101, 102, then search for corresponding service data, generate response data, and return the generated response data to the

terminal device

101, 102.

It should be noted that the method for providing the voice service provided by the embodiment of the present application may be executed by the server 104, and accordingly, the apparatus for providing the voice service may be disposed in the server 104.

It should be understood that the number of terminal devices, networks, servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for providing voice services in accordance with the present application is shown. The method for providing a voice service includes the steps of:

step 201, acquiring a voice input signal.

In the present embodiment, an electronic device (e.g., a server shown in fig. 1) on which the above-described method for providing a voice service operates may acquire a voice input signal generated according to voice information uttered by a user through a network. Specifically, the electronic device may establish a connection with a terminal device having an audio input interface (e.g., the terminal device shown in fig. 1) through a network, and the terminal device may acquire voice information uttered by a user through the audio input interface, encode the voice information to generate a voice input signal, and transmit the voice input signal to the electronic device on which the method for providing a voice service operates through the network.

Generally, a terminal device having an audio input device (e.g., a microphone) may have a voice interaction application installed thereon, and a user may wake up a voice assistant by a gesture, a specific key, or a specific audio signal, and then the terminal device may detect a sound made by the user and encode the sound according to the detected sound to generate a voice input signal. Thereafter, in order to acquire response data to the voice input signal, the terminal device may request a connection with the voice server and transmit the voice input signal to the voice server. The voice server may receive the voice input signal generated by the terminal device through the network.

Step 202, recognizing the tone and the speaking content in the speech input signal by using the semantic recognition model trained by the machine learning method to obtain corresponding tone input information and text input information.

In this embodiment, the electronic device may utilize a trained semantic recognition model to simultaneously recognize the mood and the content of the utterance in the speech input signal, where the recognition result of the mood is mood input information and the recognition result of the content of the utterance is text input information. Here, the tone input information is used to indicate a tone type of the voice input signal. Mood types may include statements, questions, exclamations, imperatives, and the like. Alternatively, the mood input information may be represented by a corresponding mood type tag. For example, the tone type labels of a statement, question, exclamation point, and prose may be predefined as < ind >, < int >, < rhe >, < exc >, < ime >, and the tone input information may be represented by these labels.

The semantic recognition model can be a model which is trained by adopting a machine learning algorithm in advance. Specifically, a machine learning algorithm based on a decision tree, a support vector machine, a neural network, a deep neural network and the like can be adopted, and the semantic recognition model is trained by using training samples. In this embodiment, the input of the semantic recognition model may be a speech signal, and the output may be a mood type corresponding to the speech signal and text content obtained by converting the speech signal.

When the user uses different moods to communicate, the tone of the speech is different, and the specific expression is that the positions of the accent and the undertone are different. For example, the end of a question sentence is usually a soft note, the pitch of a statement sentence is relatively uniform, the accent of a question sentence is usually at the beginning of a sentence, and so on. In this embodiment, the semantic recognition model may extract a pitch feature in the speech input signal, and recognize the mood input information based on the pitch feature.

In some optional implementations of the present embodiment, the semantic recognition model is trained based on labeled training samples. The method comprises the steps of collecting voice signals containing different tone types as sample voice signals, manually marking text contents and tone types corresponding to the sample voice signals, then taking the sample voice signals as input of a semantic recognition model, taking the corresponding text contents and tone types as output of the semantic recognition model, and continuously adjusting and optimizing the structure and parameters of the semantic recognition model to enable the recognition result of the semantic recognition model to approach the result of manual marking.

The semantic recognition model trained based on the machine learning method is adopted to recognize the voice input signal, so that the tone recognition independent of tone auxiliary words is realized, the problem that the tone recognition is limited by the set corresponding rule of the tone auxiliary words and tone types is solved, and the application range of the tone recognition is expanded.

And step 203, performing voice service data query based on the tone input information and the text input information, and generating voice response information according to a query result.

In this embodiment, the electronic device may perform a voice response according to the mood input information and the text input information identified by the semantic identification model. In particular, the corresponding response data may be queried in a voice service database. In some alternative implementations, the voice service database may include preset response data templates corresponding to different tone input information and text input information, and an alternative implementation of the preset response data templates may include fixed text and tags, for example, the text input information "today weather is good" corresponding preset response data template "today weather < labelA >, air temperature < labelB >" which may include the query tone. The content to be supplemented in the preset response data template can be searched through the network, and then the query result of the voice service data is generated. For example, in the above example, if it is found that the weather is "clear" and the air temperature is "20 ℃ to 30 ℃", the "clear" may be used to replace the label "< labelA >" and the "20 ℃ to 30 ℃" may be used to replace the label "< labelB >" to generate the query result "the weather is clear today and the air temperature is 20 ℃ to 30 ℃".

In another scenario, if the text input information identified by the semantic identification model is "today is good weather", and the tone information is a statement, "today is good weather" corresponding to the preset response data template "is suitable for going to outing, and the < labelC > scenery is good" nearby, which states the tone, "then the name of the scenery suitable for going to outing nearby, such as" forest park, "currently located by the user is found through the internet, and then the label" < labelC > "in the preset response data template is replaced, and a query result" good weather is suitable for going to outing, and the forest scenery of nearby park is good "is generated.

The association relationship between the preset response data template and each tone and text input information can be preset, so that after the tone input information and the text input information corresponding to the voice input signal are determined, the corresponding preset response data template can be found according to the preset association relationship, then the content to be replaced in the preset response data template is found through network data query, analysis and the like, and the finished voice service data query result is generated.

In other alternative implementations, the voice service data query may be performed as follows: firstly, determining emotional information of the user according to the recognized mood input information, wherein the emotional information can comprise emotional states of the user, for example, the emotional information contained in the query mood can be a mild mood, the emotional information contained in the question mood can be an unhappy mood, and the emotional information contained in the exclamation mood can be an excited mood. Then, the electronic device may determine an emotion-related condition that needs to be satisfied by the response data according to the emotion information, use the emotion-related condition as an additional condition for performing a response data query according to the text input information, if the queried response data satisfies the additional condition, use the queried response data as a query result of the voice service data, otherwise, cannot use the queried response data as a query result of the voice service data.

The query result of the voice service data is usually data in a text form, and text regularization may be adopted to convert the data in the text form into voice data, so as to generate voice response information. Text regularization may be performed, for example, using a model based on a deep learning framework.

After the voice response information is generated, the voice response information may be output through an audio output interface (speaker) of a terminal device (e.g., the terminal device shown in fig. 1) connected to the electronic device, so that the smart voice service is implemented.

In some optional implementation manners of this embodiment, the step 203 of performing voice service data query based on the mood input information and the text input information and generating voice response information according to a query result may include: determining user demand information based on the mood input information and the text input information; inquiring voice service data matched with the user demand information to generate text response information; and converting the text response information into voice response information. That is, the potential needs of the user may be analyzed based on the mood input information and the text input information, resulting in the intention information of the user. In particular, the resolution of the user's intent may be performed in a variety of ways, such as may be performed using machine learning models. In some optional implementations, when analyzing the intention information of the user, the intention information having a corresponding relationship with the keyword in the tone input information and/or the text input information may be searched in a preset intention information set. As an example, the preset intention information set may include intention information "query route" corresponding to the type of mood of the query and the keywords "walking route", "driving route", "bus route", "how to go", etc., and may also correspond to the type of mood of the prayer and the keywords "query" or "plan" or "navigation", and "route".

In a further implementation manner, when the voice service data is queried based on the tone input information and the text input information and the voice response information is generated according to the query result, the tone output information corresponding to the tone input information may be queried based on the obtained corresponding relationship between the preset tone input information and the preset tone output information. Here, the mood output information is used to identify the mood of the voice response information to be generated, and when the text response information is converted into the voice response information, text-to-voice conversion can be performed on the text response information in combination with the mood output information to generate the voice response information containing the mood. That is to say, which tone type is adopted as the tone type of the voice response information may be determined according to the corresponding relationship between the different types of preset tone input information and the preset tone output information, so that after the voice service data is queried and the text response information is generated, the tone type is synthesized into the voice response information. The correspondence between the preset tone input information and the preset tone output information may be preset according to experience, for example, when the preset tone input information is a query tone, the corresponding preset tone output information may be a statement tone, and when the preset tone input information is an exclamation tone, the corresponding preset tone output information may be a statement tone or an exclamation tone. Therefore, the mood can be fused in the voice response information, the emotion color of the voice response information is enriched, and the intelligent voice interaction fluency can be promoted.

Please refer to fig. 3, which shows a schematic diagram of an application scenario according to an embodiment of the present application. As shown in fig. 3, user a may intelligently interact with smart sound box B after smart sound box B is awakened. When the user a inquires about the weather condition, the smart sound box B can transmit the collected voice signal of the user to the background voice server C. After receiving the voice signal, the voice server C can recognize that the mood of the user is the query mood by using the semantic recognition model, and the text input information is "today's weather looks good". The voice server C can determine that the user wants to know the weather condition of today according to the query tone and the "weather" keyword included in the text input information, then can search the weather forecast of today "clear, 19 to 26 degrees", judge whether the weather forecast is "good", determine that the result is "yes", use the determination result as the answer of the query of the user, and generate a response text "yes, weather is clear, temperature is 19 to 26 degrees" by combining the searched weather forecast of today, and convert the response text into a voice response signal by text regularization, and then transmit the voice response signal back to the smart sound box B. And the smart sound box B can decode and play the voice response signal. In this scenario, although the voice signal uttered by the user does not include the mood assist word or the keyword including the query intention (e.g., "what", etc.), the voice server C may recognize the query intention of the user and then respond.

According to the method for providing the voice service in the embodiment of the application, the voice input signal is obtained, and then the semanteme and the speaking content in the voice input signal are identified by utilizing the semantic identification model trained by adopting the machine learning method, so that corresponding semanteme input information and text input information are obtained, wherein the semanteme input information is used for representing the semanteme type of the voice input signal; the voice service data is inquired based on the tone input information and the text input information, the voice response information is generated according to the inquiry result, the tone of the user can be identified without the help of tone auxiliary words when the voice service is provided, so that the intention of the user is accurately detected, the response is carried out by combining the intention of the user contained in the tone of the user, the matching degree of the voice service and the user requirement is improved, and the more accurate voice service is realized.

Referring to fig. 4, shown is a flow diagram of another embodiment of a method for providing voice services in accordance with the present application. As shown in fig. 4, a flow 400 of the method for providing a voice service of the present embodiment may include the following steps:

step 401, a voice input signal is obtained.

In this embodiment, an electronic device (for example, a server shown in fig. 1) on which the method for providing a voice service operates may establish a connection with a terminal device (for example, a terminal device shown in fig. 1) having an audio input interface through a network, and the terminal device may acquire voice information uttered by a user through the audio input interface, encode the voice information to generate a voice input signal, and transmit the voice input signal to the electronic device on which the method for providing a voice service operates through the network.

Step 402, a sample dialog set is obtained, the sample dialog set comprising a plurality of segments of sample dialogues, the sample dialogues comprising audio data of a request text and audio data of a corresponding response text.

In this embodiment, the electronic device may obtain a sample dialog set including a plurality of sample dialogs. Each section of sample conversation comprises audio data of two conversation parties during communication, wherein the audio data sent by the speaking party is the audio data of the request text, and the audio data sent by the speaking party is the audio data of the corresponding response text.

For example, in a sample session, user a says "what time now" and user B answers "4 pm now", then "what time now" in the sample session is the request text and "4 pm now" in the sample session is the response text.

Step 403, determining tone information of the corresponding request text according to the audio data of the response text.

In this embodiment, the mood information of the request text can be analyzed according to the response text. Specifically, a plurality of moods that the request text may contain may be determined first as the plurality of candidate mood information, for example, "what time now" may be a query mood or an exclamation mood in the above example, and then "query" and "exclamation" may be taken as the two candidate mood information. Then, decoding and semantic analysis can be carried out according to the audio data of the response text, or tone information in the audio data of the response text is extracted, and the response text is judged to be the request text for responding to which tone type according to the semantic analysis result or the extracted tone information, so that the tone information of the request text can be determined from the candidate tone information according to the response text.

And step 404, taking the audio data of the request text, the request text and the mood information of the request text as training samples, and training the semantic recognition model by adopting a machine learning method.

And then, a semantic recognition model can be constructed, the audio data of the request text is recognized to obtain the request text, then the audio data of the request text, the corresponding request text and the mood information of the request text are used as training samples, and the semantic recognition model is input for training. Specifically, the audio data of the request text can be used as input, then the error between the output of the semantic recognition model and the corresponding tone information of the request text and the request text is calculated, and then the model parameters are adjusted to make the error converge. The semantic recognition model may be a machine learning model, and may include, but is not limited to: a logistic regression model, a hidden Markov model, a convolutional neural network model, and a recurrent neural network model. After the training is completed, a semantic recognition model which is trained by adopting a machine learning method can be obtained.

In some optional implementations of this embodiment, before step 402, the method for providing a voice service may further include a step of constructing a sample dialog set, and specifically may collect dialog corpora containing audio data of a preset request text, for example, collect dialogs containing the preset request text in a movie/tv episode or the like; then extracting the audio data of the response text corresponding to each preset request text from each dialogue corpus, namely extracting the audio data of the response text used for responding to the preset request text in the dialogue; finally, the audio data of each preset request text and the audio data of the corresponding response text are combined to generate a multi-terminal sample dialogue to form a sample dialogue set, namely the audio data of the preset request text and the extracted audio data of the response text for responding to the preset request text are extracted from dialogue linguistic data to form a segment of sample dialogue. Therefore, after the plurality of dialogue corpora are collected and the audio data of the preset request text and the audio data of the corresponding response text are extracted, a plurality of sections of sample dialogues can be generated, and therefore a sample dialogue set is generated. Alternatively, the predetermined request text may be a predetermined request text expressed in different tone types, such as "true" (may be a query tone or a statement tone), and "100 m of the male world record is broken" (may be an exclamation tone, a query tone, or a statement tone).

Step 405, recognizing the tone and the speaking content in the voice input signal by using the semantic recognition model trained by the machine learning method to obtain corresponding tone input information and text input information, wherein the tone input information is used for representing the tone type of the voice input signal.

In this embodiment, the semantic recognition model trained in step 404 may be used to simultaneously recognize the tone and the speaking content of the speech input signal, so as to obtain tone input information and text input information. The tone input information can be represented by a tone type label, and the text input information is a text corresponding to the voice input signal.

And 406, performing voice service data query based on the tone input information and the text input information, and generating voice response information according to a query result.

In this embodiment, the voice service database may be queried for corresponding response data according to the tone input information and the text input information. For example, a preset response data template corresponding to the mood input information and the text input information may be searched in the voice service database, and then the data to be supplemented in the preset response data template may be obtained from the network, so as to generate the response data. For example, emotion information of the user may be determined from the recognized speech input information, and an additional condition for voice service data query may be determined from the emotion information. Then, text regularization processing can be performed on the query result, and the queried text information is converted into voice response information.

Steps

401, 405, and 406 in the above method flow are respectively the same as

steps

201, 202, and 203 in the foregoing embodiment, and the above description for

steps

201, 202, and 203 also applies to

steps

401, 405, and 406 in this embodiment, and are not repeated here.

As can be seen from fig. 4, compared with the embodiment shown in fig. 2, the method of the present embodiment adds a step of obtaining a sample dialog set, determining the mood information of the corresponding request text according to the audio data of the response text in the sample dialog, and training the semantic recognition model by using the audio data of the request text, and the mood information of the request text in the sample dialog as training samples and using a machine learning method, so that the method for providing a speech service of the present embodiment provides a training method for the semantic recognition model, so that the semantic recognition model can better learn the inherent logic of interaction in a real dialog scene, and the reliability and accuracy of the semantic recognition model are improved.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for providing a voice service, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 5, the apparatus 500 for providing a voice service of the present embodiment may include: an acquisition unit 501, a recognition unit 502 and a response unit 503. The obtaining unit 501 is configured to obtain a voice input signal; the recognition unit 502 is configured to recognize a mood and a speaking content in a speech input signal by using a semantic recognition model trained by a machine learning method, so as to obtain corresponding mood input information and text input information, where the mood input information is used to indicate a mood type of the speech input signal; the response unit 503 is configured to perform voice service data query based on the mood input information and the text input information, and generate voice response information according to a query result.

In this embodiment, the obtaining unit 501 may establish a connection with a terminal device having an audio input interface (for example, the terminal device shown in fig. 1) through a network, and receive a voice input signal of a user obtained and encoded through the audio input interface from the terminal device.

The recognition unit 502 may recognize the voice input signal acquired by the acquisition unit 501 by using a semantic recognition model, and obtain the tone input information and the text input information of the voice input signal. The tone input information can be represented by a tone type label, and the text input information is a text corresponding to the voice input signal. The semantic recognition model can be obtained by training by adopting a machine learning algorithm such as a regression model, a deep neural network and the like, and can simultaneously recognize the tone and the speaking content in the utterance.

The response unit 503 may respond according to the text input information and the mood input information recognized by the recognition unit 502. Specifically, data that matches or can satisfy the user potential requirement included in the mood input information and can satisfy the user requirement included in the text input information can be searched in the voice database to generate text response information, and then text regularization can be adopted to convert the text response information into voice response information.

In some embodiments, the response unit 503 may be further configured to generate the voice response information as follows: determining user demand information based on the mood input information and the text input information; inquiring voice service data matched with the user demand information to generate text response information; and converting the text response information into voice response information.

In a further embodiment, the response unit 503 may further be configured to: and inquiring tone output information corresponding to the tone input information based on the acquired corresponding relation between the preset tone input information and the preset tone output information, wherein the tone output information is used for identifying the tone of the voice response information to be generated. At this time, the response unit 503 may be further configured to convert the text response information into the voice response information as follows: and performing text-to-speech conversion on the text response information by combining the tone output information to generate speech response information containing tones.

In some embodiments, the apparatus 500 may further include: the system comprises a sample acquisition unit, a data processing unit and a data processing unit, wherein the sample acquisition unit is used for acquiring a sample conversation set, the sample conversation set comprises a plurality of sections of sample conversations, and the sample conversations comprise audio data of a request text and audio data of a corresponding response text; the determining unit is used for determining tone information of the corresponding request text according to the audio data of the response text; and the training unit is used for taking the audio data of the request text, the request text and the mood information of the request text as training samples and training the semantic recognition model by adopting a machine learning method.

In some embodiments, the apparatus 500 may further include a construction unit for constructing the sample dialog set. The construction unit may construct the sample dialog set as follows: collecting dialogue linguistic data containing audio data of a preset request text; extracting the audio data of the response text corresponding to each preset request text from each dialogue corpus; and combining the audio data of each preset request text and the audio data of the corresponding response text to generate a plurality of sections of sample conversations so as to form a sample conversation set.

The device 500 for providing voice service according to the embodiment of the application obtains a voice input signal through an obtaining unit, and a recognition unit recognizes the tone and the speaking content in the voice input signal by using a semantic recognition model trained by a machine learning method to obtain corresponding tone input information and text input information, wherein the tone input information is used for representing the tone type of the voice input signal; the response unit carries out voice service data query based on the tone input information and the text input information, generates voice response information according to a query result, can accurately detect the intention of a user through tone recognition when voice service is provided, and then responds by combining the intention of the user contained in the tone of the user, so that the matching degree of the voice service and the user requirement is improved, and more accurate voice service is realized.

It should be understood that the units recited in the apparatus 500 may correspond to various steps in the methods described with reference to fig. 2 and 4. Thus, the operations and features described above for the method are equally applicable to the apparatus 500 and the units included therein, and are not described in detail here.

Referring now to FIG. 6, shown is a block diagram of a computer system 600 suitable for use in implementing a server according to embodiments of the present application. The terminal device or the server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 600 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the system 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 601. It should be noted that the computer readable medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a recognition unit, and a response unit. The names of these units do not in some cases constitute a limitation on the unit itself, and for example, the acquisition unit may also be described as a "unit that acquires a voice input signal".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a voice input signal; recognizing the tone and the speaking content in the voice input signal by utilizing a semantic recognition model trained by adopting a machine learning method to obtain corresponding tone input information and text input information, wherein the tone input information is used for expressing the tone type of the voice input signal; and performing voice service data query based on the tone input information and the text input information, and generating voice response information according to a query result.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for providing voice services, the method comprising:

obtaining a sample dialog set, wherein the sample dialog set comprises a plurality of sections of sample dialogues, and the sample dialogues comprise audio data of a request text and audio data of a corresponding response text;

determining tone information of the corresponding request text according to the audio data of the response text, wherein the tone information comprises: determining a plurality of candidate tone information according to a plurality of tones contained in the request text; determining the tone information of the request text from the candidate tone information according to the response text;

taking the audio data of the request text, the request text and the mood information of the request text as training samples, and training the semantic recognition model by adopting a machine learning method;

acquiring a voice input signal;

recognizing the tone and the speaking content in the voice input signal by utilizing a semantic recognition model trained by adopting a machine learning method to obtain corresponding tone input information and text input information, wherein the tone input information is used for expressing the tone type of the voice input signal;

and performing voice service data query based on the tone input information and the text input information, and generating voice response information according to a query result.

2. The method of claim 1, wherein the performing a voice service data query based on the mood input information and the text input information, and generating voice response information according to a query result comprises:

determining user demand information based on the mood input information and the text input information;

inquiring voice service data matched with the user demand information to generate text response information;

and converting the text response information into the voice response information.

3. The method of claim 2, wherein the performing a voice service data query based on the mood input information and the text input information, and generating voice response information according to a query result further comprises:

inquiring tone output information corresponding to the tone input information based on the acquired corresponding relation between the preset tone input information and the preset tone output information, wherein the tone output information is used for identifying the tone of the voice response information to be generated;

the converting the text response information into the voice response information includes:

and performing text-to-speech conversion on the text response information by combining the tone output information to generate speech response information containing tones.

4. The method of claim 1, further comprising the step of constructing the sample dialog set, comprising:

collecting dialogue linguistic data containing audio data of a preset request text;

extracting audio data of a response text corresponding to each preset request text from each dialogue corpus;

and combining the audio data of each preset request text and the audio data of the corresponding response text to generate a plurality of sections of sample conversations so as to form the sample conversation set.

5. An apparatus for providing voice services, the apparatus comprising:

the system comprises a sample acquiring unit, a processing unit and a processing unit, wherein the sample acquiring unit is used for acquiring a sample dialogue set, the sample dialogue set comprises a plurality of sections of sample dialogues, and the sample dialogue comprises audio data of a request text and audio data of a corresponding response text;

a determining unit, configured to determine, according to the audio data of the response text, mood information of a corresponding request text, and further configured to: determining a plurality of candidate tone information according to a plurality of tones contained in the request text; determining the tone information of the request text from the candidate tone information according to the response text;

the training unit is used for taking the audio data of the request text, the request text and the mood information of the request text as training samples and training the semantic recognition model by adopting a machine learning method;

an acquisition unit for acquiring a voice input signal;

the recognition unit is used for recognizing the tone and the speaking content in the voice input signal by utilizing a semantic recognition model trained by adopting a machine learning method to obtain corresponding tone input information and text input information, wherein the tone input information is used for expressing the tone type of the voice input signal;

and the response unit is used for carrying out voice service data query based on the tone input information and the text input information and generating voice response information according to a query result.

6. The apparatus of claim 5, wherein the response unit is further configured to generate the voice response information as follows:

7. The apparatus of claim 6, wherein the response unit is further configured to:

inquiring tone output information corresponding to the tone input information based on the acquired corresponding relation between the preset tone input information and the preset tone output information, wherein the tone output information is used for identifying the tone of the voice response information to be generated; and

the response unit is further configured to convert the text response information into the voice response information as follows:

8. The apparatus of claim 5, further comprising a construction unit for constructing the sample dialog set, the construction unit constructing the sample dialog set as follows:

9. A server, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-4.