CN115641875A

CN115641875A - Voice information processing method, apparatus, device, storage medium and program product

Info

Publication number: CN115641875A
Application number: CN202211276151.0A
Authority: CN
Inventors: 吉培轩; 郭启行; 王福到
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-10-17
Filing date: 2022-10-17
Publication date: 2023-01-24

Abstract

The present disclosure provides a method, an apparatus, a device, a storage medium and a program product for processing voice information, which belong to the technical field of artificial intelligence and computers, and particularly relate to the technical field of voice processing and deep learning. The specific implementation scheme of the voice information processing method is as follows: determining a target audio characteristic associated with the object identifier according to the object identifier associated with the received voice information, wherein the target audio characteristic represents an audio characteristic of a target object indicated by the object identifier; and under the manual question answering mode, detecting whether the voice information is matched with the target audio frequency characteristics or not to obtain a target object detection result.

Description

Voice information processing method, apparatus, device, storage medium and program product

Technical Field

The present disclosure relates to the field of artificial intelligence and computer technologies, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for processing speech information.

Background

The voice processing is an important branch of artificial intelligence, and can be applied to various application occasions, but the voice information processing efficiency in some application scenes is not high, and the actual requirements in corresponding application scenes cannot be met.

Disclosure of Invention

The present disclosure provides a voice information processing method, apparatus, device, storage medium, and program product.

According to an aspect of the present disclosure, there is provided a voice information processing method including: determining a target audio characteristic associated with the object identifier according to the object identifier associated with the received voice information, wherein the target audio characteristic represents an audio characteristic of a target object indicated by the object identifier; and under the manual question answering mode, detecting whether the voice information is matched with the target audio frequency characteristics or not to obtain a target object detection result.

According to another aspect of the present disclosure, there is provided a voice information processing apparatus including: the device comprises a target audio characteristic determining module and a target object detection result determining module. The target audio characteristic determination module is used for determining a target audio characteristic associated with the object identifier according to the object identifier associated with the received voice information, wherein the target audio characteristic represents the audio characteristic of the target object indicated by the object identifier; and the target object detection result determining module is used for detecting whether the voice information is matched with the target audio frequency characteristics in the manual question answering mode to obtain a target object detection result.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor and a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of the embodiments of the disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program, the computer program being stored on at least one of a readable storage medium and an electronic device, the computer program being stored on at least one of the readable storage medium and the electronic device, the computer program, when executed by a processor, implementing the method of an embodiment of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates a system architecture diagram of a voice information processing method and apparatus according to an embodiment of the present disclosure;

FIG. 2 schematically shows a flow chart of a method of speech information processing according to an embodiment of the present disclosure;

fig. 3 schematically shows a schematic diagram of a speech information processing method according to another embodiment of the present disclosure;

fig. 4 schematically shows a schematic diagram of obtaining a target object detection result of a voice information processing method according to yet another embodiment of the present disclosure;

fig. 5 schematically shows a schematic diagram of a voice information processing method according to yet another embodiment of the present disclosure;

fig. 6 schematically shows a schematic diagram of a voice information processing method according to yet another embodiment of the present disclosure;

FIG. 7 schematically shows a block diagram of a speech information processing apparatus according to an embodiment of the present disclosure; and

fig. 8 schematically shows a block diagram of an electronic device that can implement the voice information processing method of the embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B, and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B, and C" would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.).

The speech processing is an important branch of artificial intelligence, and can be applied to various application occasions, and the speech information processing method according to the embodiment of the disclosure is described as an example in the following. A call center may be understood as a service organization consisting of a collection of customer service personnel (i.e., customer service) at a relatively centralized location, typically using computer communication technology, to handle telephone queries from businesses, customers, etc. The call center has the capability of simultaneously processing a large number of incoming calls and displaying calling numbers, can automatically distribute the incoming calls to customer service personnel with corresponding skills for processing, and can record and store all incoming call information.

In a call center scene, a plurality of customer service agents can be set, each customer service agent is provided with a call terminal, and each customer service person is matched with the call terminal of at least one customer service agent to answer the incoming call.

In an actual call center scenario, the following scenario exists.

1) With the development of computer technology and internet technology, robot customer service can replace manual customer service to perform a part of call interaction, but the robot customer service is usually set to a certain tone in general and voice call is more mechanical, which reduces the smoothness and efficiency of the call process. For example, a service object talking to a customer service may thus exclude robot customer service.

2) For example, customer service agent a matches customer service person a, and customer service agent B matches customer service person B. For some reasons, the incoming call of the call terminal of the customer service seat a is answered by the customer service staff b, which causes a call answering disorder of the customer service seat, and the call answering disorder not only affects the management of a call center scene, but also affects the call experience of a service object.

It should be noted that in the technical solution of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of personal information of a user, such as call voice, are all in accordance with the regulations of related laws and regulations, and do not violate the customs of the public order.

Fig. 1 schematically shows a system architecture of a voice information processing method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, a system architecture 100 according to an embodiment of the present disclosure may include

telephony terminals

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

call terminals

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

call terminals

101, 102, 103 may be electronic devices having a call function, and the

call terminals

101, 102, 103 may include, for example, fixed telephones. The target object can use the

call terminals

101, 102, 103 to make a call, and the obtained voice information can be sent to the server 105 for processing, for example. For example, in a call center scenario, the target object may be a customer service person, and the customer service person may use a call terminal to communicate with the service object.

The server 105 may be a server that provides various services, for example, processing voice information of a call made by a target object using the

call terminals

101, 102, 103. In addition, the server 105 may also be a cloud server, i.e., the server 105 has a cloud computing function.

Illustratively, the system architecture 100 according to embodiments of the present disclosure may also include clients 106, for example.

The client 106 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablets, laptop portable computers, desktop computers, and the like. The client 106 of the disclosed embodiments may run an application, for example. The target object detection result, the key content, and the like processed by the server 105 may be transmitted to and displayed on the client 106, for example.

It should be noted that the voice information processing method provided by the embodiment of the present disclosure may be executed by the server 105. Accordingly, the voice information processing apparatus provided by the embodiment of the present disclosure may be disposed in the server 105. The voice information processing method provided by the embodiment of the present disclosure may also be executed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

call terminals

101, 102, 103 and/or the server 105. Accordingly, the voice information processing apparatus provided in the embodiment of the present disclosure may also be disposed in a server or a server cluster that is different from the server 105 and is capable of communicating with the

call terminals

101, 102, 103 and/or the server 105.

It should be understood that the number of telephony terminals, networks, servers, and clients in fig. 1 is merely illustrative. There may be any number of telephony terminals, networks, servers and clients, as desired for an implementation.

The following describes a voice information processing method according to an exemplary embodiment of the present disclosure with reference to fig. 2 to 6 in conjunction with the system architecture of fig. 1. The voice information processing method of the embodiment of the present disclosure may be performed by the server 105 shown in fig. 1, for example.

Fig. 2 schematically shows a flow chart of a voice information processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the voice information processing method 200 of the embodiment of the present disclosure may include, for example, operations S210 to S220.

In operation S210, a target audio feature associated with an object identifier is determined according to the object identifier associated with the received voice information.

Illustratively, the call terminal may send voice information of a call, the server receives the voice information sent by the call terminal, and the server may further determine a corresponding call terminal according to the voice information and determine an object identifier matching with the call terminal according to the corresponding call terminal.

It can be understood that the voice call using the call terminal involves two parties, one of which is a target object, and the voice information processing method of the embodiment of the present disclosure is directed to the voice information of the target object. In a call center scenario, a "target object" as one of two parties involved in a voice call is a customer service person, and the other party is a service object. The following description will take the example that the voice information processing method of the embodiment of the present disclosure is applied to a call center scenario.

The object identifier may be, for example, an agent identifier matched with the call terminal or an identifier of a customer service person, and in the case that the object identifier is the agent identifier matched with the call terminal, the object identifier may be matched with the corresponding customer service person according to the agent identifier.

Illustratively, the object identifier matched with the call terminal can be determined according to the call terminal sending the voice information.

The target audio feature characterizes an audio feature of the target object indicated by the object identification.

Illustratively, the target audio characteristic may be predetermined, for example. For example, the target audio features corresponding to each target object may be registered in advance to obtain a target audio feature set. Upon receiving the speech information, a target audio feature associated with the object identification may be determined from the set of target audio features, for example, based on the object identification associated with the speech information.

In operation S220, in the manual question and answer mode, it is detected whether the voice information matches with the target audio features, so as to obtain a target object detection result.

The manual question-and-answer mode can be understood as a mode in which a call interaction is performed manually.

The object identification is associated with the target object that it indicates, and in some cases the speech information is confused with the object identification, which may cause a mismatch between the speech information and the target object.

Each target object has its specific target audio characteristics, which can be used as a basis for detecting whether the speech information matches the target object.

According to the voice information processing method disclosed by the embodiment of the disclosure, whether incoming call answering is disordered can be automatically detected by detecting whether the voice information is matched with the target audio frequency characteristics or not according to the obtained target object detection result. Since the target object has the specific target audio characteristics, an accurate target object detection result can be obtained by detecting whether the voice information is matched with the target audio characteristics.

Illustratively, the target object detection result may also be transmitted, for example. For example, the target object detection result may be sent to the relevant person, so that the relevant person may know whether the voice information and the target audio feature are matched through the fed back target object detection result, thereby determining whether the voice information and the target object are confused. In case the target object detection result characterizes that the object identification associated with the speech information does not match the target audio feature, an intervention may also be performed, for example.

Fig. 3 schematically shows a schematic diagram of a voice information processing method according to another embodiment of the present disclosure.

As shown in fig. 3, the voice information processing method 300 according to an embodiment of the present disclosure may further include operation S330.

In operation S330, in a case that the target object detection result indicates that the voice information does not match the target audio feature, a first target interactive voice for output is determined according to the voice information and the target audio feature.

And in the case that the target object detection result represents that the voice information is not matched with the target audio characteristics, the voice information and the target object are confused. Taking a call center application scenario as an example, a target object detection result indicating that the voice information does not match with the target audio feature indicates that the customer service person has a shift, for example, the customer service person a has answered an incoming call of the customer service person B.

Illustratively, the voice information may be processed according to the target audio characteristics, for example, by using a voice synthesis model or the like, so that the content of the first target interactive voice is consistent with the voice information, and the first target interactive voice has the target audio characteristics. It will be appreciated that each target object has its specific audio characteristics, which may for example embody pitch characteristics, frequency characteristics, and other deeper characteristics of the target object. The first target interactive voice obtained by processing the voice information according to the target audio characteristics can have the target audio characteristics and restore the audio characteristics of the target object matched with the object identifier.

According to the voice information processing method, the situation that the voice information and the target object are disordered can be intervened through the first target interactive voice which is determined to be used for outputting according to the voice information and the target audio characteristics, so that the first target interactive voice which is output has the target audio characteristics, and the audio characteristics of the target object matched with the voice information are restored. The service object hears the first target interactive voice, so that the conversation experience of the service object can be improved.

In the example of fig. 3, operations S310 to S320 are also schematically shown. Operations S310 and S320 are similar to operations S210 and S220, respectively, and are not described herein again.

In the example of fig. 3, a specific example that the service object 301 and the target object 303-S1 are communicated in the manual question and answer mode through the communication terminal 302-S2 to obtain the voice information 304 is schematically shown. In the example of fig. 3, the object identifier 305 of the target object matched with the call terminal 302-S2 is S2, the voice information 304 sent by the call terminal 302-S2 is associated with the object identifier 305, and the target audio characteristic F-S2 associated therewith can be determined by the object identifier 305. The voice information 304 is obtained by the target object 303-S1 talking with the service object 301, so that the voice information 304 represents the voice of the target object S1, and by detecting the voice information 304 and the target audio feature F-S2, a result that the target object S1 corresponding to the voice information 304 and the target object S2 corresponding to the target audio feature F-S2 are not matched can be obtained, so that the target object detection result 306 represents that the voice information 304 is not matched with the target audio feature F-S2. At this time, the first target interactive voice 307 for output may be determined according to the voice information 304 and the target audio feature F-S2. The content of the first target interaction speech 307 may, for example, coincide with the speech information 304, and the first target interaction speech 307 may have a target audio feature F-S2.

In the example of fig. 3, a specific example of N target objects 303 and a total of N object identifications of S1 to SN of the target objects 303 is also schematically shown, for example in a call center scenario. Fig. 3 also schematically shows that each target object is associated with a respective target audio feature by means of an object identification, e.g. object identification S1 is associated with target audio feature F-S1.

The voice information processing method according to still another embodiment of the present disclosure may further include: and determining an audio feature set according to the target object indicated by each object identification.

For any object identifier, the audio feature set comprises a plurality of candidate audio features associated with the object identifier, and the candidate audio features represent a plurality of pronunciation states of the target object.

The target audio features characterize the unique audio features of the corresponding target object. The target audio features are used as a basis for identifying whether the voice information is matched with the target object, and the target audio features need to accurately represent the audio features of the target object. However, the target object is an object for receiving an incoming call, and the pronunciation state of the target object is greatly changed in some cases, so that the target audio features cannot be accurately matched with the corresponding target object. For example, when a customer service person who is a target has an emotion such as excitement or depression, the pronunciation state of the customer service person changes from a normal emotion.

According to the voice information processing method, the audio characteristics of the target object can be accurately and comprehensively characterized through the candidate audio characteristics in the multiple pronunciation states aiming at the target object, and the matching accuracy of the voice information and the target object can be improved subsequently.

Fig. 4 schematically shows a schematic diagram of obtaining a target object detection result of a speech information processing method according to yet another embodiment of the present disclosure.

As shown in fig. 4, for example, the following embodiment may be used to implement the operation S420 of detecting whether the voice information matches with the target audio feature in the manual question and answer mode, so as to obtain a specific example of the target object detection result.

In operation S421, based on the voice information 404, the voice feature 406 is determined.

For example, the mel-frequency cepstrum coefficient of the voice information may be determined, the mel-frequency cepstrum coefficient of the voice information may be used as a spectrum feature, and the spectrum feature may be processed, for example, by convolution, to extract the voice feature.

Mel Frequency Cepstral Coefficients, abbreviated as MFCC. The parameters determined based on the mel frequency cepstrum coefficient have better robustness, better accord with the hearing characteristics of human ears, and still have better identification performance when the signal-to-noise ratio is reduced.

Illustratively, this may be achieved by: pre-emphasis → framing → windowing → fast fourier transform → triangular band-pass filter → mel-frequency filter bank → calculating the logarithmic energy of each filter bank output → obtaining MFCC by discrete cosine transform. Pre-emphasis can be achieved by passing the target speech data through a high-pass filter. The high frequency part can be boosted by pre-emphasis so that the spectrum of the signal becomes flat and the signal remains in the entire band from low to high frequencies. The pre-emphasis can also eliminate the vocal cords and lip effects in the sounding process to compensate the high-frequency part of the voice signal suppressed by the sounding system and highlight the formants of high frequency.

In operation S422, the similarity between the speech feature and a plurality of candidate audio features of the target object is determined, resulting in a plurality of candidate similarities.

For example, the similarity between the speech feature and the candidate audio features of the target object may be characterized by using parameters such as cosine similarity or spatial distance between the speech feature and the candidate audio features of the target object, so as to obtain a plurality of candidate similarities.

In operation S423, a target object detection result 407 is determined according to the plurality of candidate similarities.

For example, a target similarity may be preset, the magnitude of the candidate similarities is compared with the target similarity, and in the case that the candidate similarities have a value greater than or equal to the target similarity, it may be determined that the target object detection result represents that the voice information matches the target audio feature. And under the condition that the numerical values of the candidate similarity degrees are all smaller than the target similarity degree, determining that the target object detection result representation voice information is not matched with the target audio characteristics.

According to the voice information processing method, the similarity between the voice characteristics of the voice information and the candidate audio characteristics of the target object is determined, the obtained candidate similarities are used as a reference basis for determining the detection result of the target object, and the accuracy is higher.

In the example of fig. 4, a specific example is schematically shown in which the service object 401 and the target object 403-S1 are communicated in the manual question and answer mode through the communication terminal 402-S2 to obtain the voice information 404. In the example of fig. 4, the object identifier 405 of the target object matching the call terminal 402-S2 is S2, the voice information 404 sent by the call terminal 402-S2 is associated with the object identifier 405, and M candidate audio features associated therewith can be determined by the object identifier 405. The M candidate audio features are candidate audio features CF1-S2 through candidate audio features CFM-S2.

In the example of fig. 4, for example, a speech feature 406 may also be obtained according to the speech information 404, and for example, similarities of M candidate audio features of the speech feature 406 associated with the object identifier 405 may be calculated to obtain M candidate similarities, where the M candidate similarities include a candidate similarity FS1 to a candidate similarity FSM. The candidate similarity FS1 may be characterized, for example, by a cosine similarity or a spatial distance between the speech feature 406 and the corresponding candidate audio features CF 1-S2.

Fig. 5 schematically shows a schematic diagram of a voice information processing method according to still another embodiment of the present disclosure.

As shown in fig. 5, the voice information processing method 500 according to still another embodiment of the present disclosure may further include operation S540.

In operation S540, in the machine question and answer mode, a second target interactive voice 507 for output is determined according to the standard interactive voice 506 and the target audio features corresponding to the machine question and answer mode.

The machine question-answer mode may be understood as a mode in which a call interaction is performed by a machine. For example, in some cases, in addition to the way of manually performing call interaction, some conventional question-answer interaction or the like may be handled using, for example, a customer service robot or the like. In the example of fig. 5, a specific example of determining standard interactive speech 506 by an electronic device, computer 503, is schematically shown.

In the machine question answering mode, parameters such as timbre and pronunciation interval of the standard interactive speech 506 output by the robot are set, so that the standard interactive speech is usually psychologically repulsive to the standard interactive speech compared with a machine such as a service object.

Illustratively, the standard interactive voice may be processed according to the target audio characteristics, for example, by using a voice synthesis model, so that the content of the second target interactive voice is consistent with the standard interactive voice, and the second target interactive voice has the target audio characteristics.

According to the voice information processing method, in the machine question answering mode, the second target interactive voice is determined to have the target audio frequency characteristics through the standard interactive voice and the target audio frequency characteristics, and the audio frequency characteristics of the target object matched with the voice information are restored. The service object hears the second target interactive voice, so that the conversation experience of the service object can be improved.

Illustratively, operation S540 may be performed, for example, before operation S210 or operation S310.

In the example of fig. 5, a specific example of obtaining the voice information 504 by the call terminal 502-S2 making a call with the service object 501 in the machine question and answer mode is schematically shown. A specific example of determining the target audio feature F-S2 associated with the object identifier 505 according to the object identifier 505, specifically S2, associated with the received voice information 504 in operation S510 is also schematically shown, and reference may be made to the description of the foregoing embodiment, which is not repeated herein.

According to a voice information processing method of a further embodiment of the present disclosure, the machine question-and-answer mode may include a first machine question-and-answer stage and a second machine question-and-answer stage, and the standard interactive voice includes a first standard interactive voice and a second standard interactive voice. The first standard interactive voice corresponding to the first machine question-answering stage can be executed by the robot, and the second standard interactive voice corresponding to the second machine question-answering stage can be executed by the robot or a target object indicated by the object identification. The historical response frequency of the first standard interactive voice is higher than that of the second standard interactive voice.

To alleviate the pressure of manual conversation, standard interactive speech may be executed by the robot, which is more adaptive in some business scenarios, typically involving responses to some conventional questions. Some business questions are answered manually, but in an actual scene, there are cases where a large number of business questions cannot be answered manually.

Illustratively, for example, historical incoming call records may be counted to determine a business question with a high frequency of response by manual response. And determining corresponding standard interactive voice, namely second standard interactive voice corresponding to the second machine question-answering stage, aiming at each service question with high answering frequency of manual answering. When a large number of subsequent calls relate to a service question with high answering frequency, the robot can execute second standard interactive voice for answering, and the pressure of manual conversation is reduced.

And under the condition that the incoming call of the service question can be answered manually, the second standard interactive voice can be executed manually, so that the flexibility is higher.

Fig. 6 schematically shows a schematic diagram of a voice information processing method according to still another embodiment of the present disclosure.

As shown in fig. 6, the voice information processing method 600 according to still another embodiment of the present disclosure may include, for example, operations S650 to S680.

In operation S650, in response to the question-answering mode adjustment instruction 601, the machine question-answering mode M1 corresponding to the current object identifier is switched to the manual question-answering mode M2.

Illustratively, the question-answer mode adjustment instruction may be determined by the target object, for example. According to the voice information processing method of the embodiment, the question answering mode can be adjusted, and different scenes can be flexibly adapted.

For example, a certain target object corresponds to the call terminal P1 and the call terminal P2, when both call terminals are calling, the target object may answer the call L1 of the call terminal P1, the call L2 of the call terminal P2 may be handled by the robot in the machine question-and-answer mode, and when the call L1 currently answered by the target object is ended, the machine question-and-answer mode corresponding to the call L2 may be switched to the manual question-and-answer mode.

In operation S660, the standard interactive voice 602 of the machine question answering mode M1 and the answering content 603 corresponding to the standard interactive voice 602 are determined.

It can be understood that, in the machine question-answering mode, the call interaction mode relates to both the robot and the service object, the standard question-answering voice 602 in the machine question-answering mode is executed by the robot, the corresponding response interaction is made by the service object, and the response content corresponding to the standard interaction voice can be determined according to the response interaction made by the service object, so that a complete call interaction process can be obtained. The complete conversation interactive process can embody the intention of the service object and achieve the conversation purpose.

In operation S670, keywords are extracted from the response content 603, resulting in the key content 604.

In operation S680, in the case where the machine question-and-answer mode M1 is switched to the manual question-and-answer mode M2, the key content 604 is fed back to the target object 605.

In an actual application scenario, when a machine question-and-answer mode is switched to an artificial question-and-answer mode for a certain incoming call, a target object does not know the specific situation of conversation interaction with a service object in the machine question-and-answer mode.

According to the voice information processing method disclosed by the embodiment of the disclosure, under the condition that the machine question-answering mode is switched to the manual question-answering mode, the specific condition of the call interaction of the current incoming call can be obtained by determining the standard interactive voice and the corresponding response content under the machine question-answering mode. By extracting keywords from the response content and feeding the obtained keyword content back to the target object, the target object can obtain the key information of the current call interaction, and the target object can conveniently, efficiently, quickly and smoothly carry out call interaction with the service object in the manual question and answer mode.

Illustratively, for example, the key content may be sent to a client corresponding to the target object, and the key content may be displayed on the client of the target object, for example, to implement a specific example of feeding back the key content to the target object.

The voice information processing method according to still another embodiment of the present disclosure may further include, for example: determining an evaluation value of the key content according to the evaluation parameter; and determining the question-answering mode switching priority of the corresponding machine question-answering mode according to the evaluation value of the key content.

The question-answering mode switching priority represents the priority of switching from the machine question-answering mode to the manual question-answering mode.

The evaluation parameters include the ratio of the key content to the full amount of standard interactive speech.

The ratio of the key content to the full standard interactive voice can represent the conversation interactive progress of the robot in the current machine question-answering mode.

Illustratively, the incoming call answering efficiency can be improved in a manner that one target object corresponds to a plurality of call terminals, for example.

When a plurality of call terminals simultaneously call, the target object can determine to manually answer a certain current call, other calls can be set to be in a machine question-and-answer mode, for example, and when the current call is ended, a call which needs to be manually answered more can be determined from other calls in the machine question-and-answer mode.

According to the voice information processing method of the embodiment of the disclosure, the voice information processing method can be adapted to the above-mentioned scene, and by evaluating the parameters, the determined evaluation value of the key content can be used for evaluating the degree of necessity of switching the machine question-answering mode, and the evaluation value of the key content can be used for determining the question-answering mode switching priority of the corresponding machine question-answering mode.

The target object can selectively answer a plurality of incoming calls in the machine question answering mode according to the question answering mode switching priority, and the call answering method has higher call efficiency.

Fig. 7 schematically shows a block diagram of a speech information processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the speech information processing apparatus 700 of the embodiment of the present disclosure includes, for example, a target audio feature determination module 710 and a target object detection result determination module 720.

And the target audio characteristic determining module 710 is configured to determine, according to the object identifier associated with the received voice information, a target audio characteristic associated with the object identifier.

Target audio feature characterization object identification indicates audio features of target object

And the target object detection result determining module 720 is configured to detect whether the voice information matches with the target audio features in the manual question answering mode, so as to obtain a target object detection result.

The voice information processing apparatus according to the embodiment of the present disclosure further includes: and the first target interactive voice determining module is used for determining the first target interactive voice for output according to the voice information and the target audio characteristics under the condition that the target object detection result represents that the voice information is not matched with the target audio characteristics.

The voice information processing apparatus according to the embodiment of the present disclosure further includes: and the audio feature set determining module is used for determining an audio feature set according to the target object indicated by each object identifier, wherein the audio feature set comprises a plurality of candidate audio features associated with the object identifiers aiming at any object identifier, and the candidate audio features represent a plurality of pronunciation states of the target object.

According to the embodiment of the present disclosure, the target object detection result determination module includes: the device comprises a voice feature determining submodule, a candidate similarity determining submodule and a target object detection result determining submodule.

And the voice characteristic determining submodule is used for determining the voice characteristics according to the voice information.

And the candidate similarity determining submodule is used for determining the similarity between the voice characteristic and a plurality of candidate audio characteristics of the target object to obtain a plurality of candidate similarities.

And the target object detection result determining submodule is used for determining a target object detection result according to the candidate similarities.

The voice information processing apparatus according to the embodiment of the present disclosure further includes: and the second target interactive voice determining module is used for determining a second target interactive voice for output according to the standard interactive voice and the target audio characteristics corresponding to the machine question-answering mode in the machine question-answering mode.

According to the voice information processing device disclosed by the embodiment of the disclosure, the machine question-answering mode comprises a first machine question-answering stage and a second machine question-answering stage, and the standard interactive voice comprises a first standard interactive voice and a second standard interactive voice; the first standard interactive voice corresponding to the first machine question-answering stage is executed by the robot, and the second standard interactive voice corresponding to the second machine question-answering stage is executed by the robot or a target object indicated by the object identification; the historical response frequency of the first standard interactive voice is higher than that of the second standard interactive voice.

The voice information processing apparatus according to the embodiment of the present disclosure further includes: the system comprises a question-answer mode switching module, an answer content determining module, a key content determining module and a key content feedback module.

And the question-answer mode switching module is used for responding to the question-answer mode adjusting instruction and switching the machine question-answer mode corresponding to the current object identifier into an artificial question-answer mode.

And the response content determining module is used for determining the standard interactive voice of the machine question-answering mode and the response content corresponding to the standard interactive voice.

And the key content determining module is used for extracting the key words from the response content to obtain the key content.

And the key content feedback module is used for feeding back key contents to the target object under the condition that the machine question-answering mode is switched to the manual question-answering mode.

The voice information processing apparatus according to the embodiment of the present disclosure further includes: an evaluation value determining module and a question-answering mode switching priority determining module.

And the evaluation value determining module is used for determining the evaluation value of the key content according to the evaluation parameter.

The evaluation parameters include the ratio of key content to full-scale standard interactive speech

And the question-answer mode switching priority determining module is used for determining the question-answer mode switching priority of the corresponding machine question-answer mode according to the evaluation value of the key content.

It should be understood that the embodiments of the apparatus part of the present disclosure are the same as or similar to the embodiments of the method part of the present disclosure, and the technical problems to be solved and the technical effects to be achieved are also the same as or similar to each other, and the detailed description of the present disclosure is omitted.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as a voice information processing method. For example, in some embodiments, the speech information processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the speech information processing method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the voice information processing method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of processing speech information, comprising:

determining a target audio characteristic associated with the object identifier according to the object identifier associated with the received voice information, wherein the target audio characteristic represents an audio characteristic of a target object indicated by the object identifier;

and under the manual question answering mode, detecting whether the voice information is matched with the target audio frequency characteristics or not to obtain a target object detection result.

2. The method of claim 1, further comprising:

and under the condition that the target object detection result represents that the voice information is not matched with the target audio characteristic, determining first target interactive voice for output according to the voice information and the target audio characteristic.

3. The method of claim 1, further comprising:

and determining an audio feature set according to the target object indicated by each object identifier, wherein for any one object identifier, the audio feature set comprises a plurality of candidate audio features associated with the object identifier, and the candidate audio features represent a plurality of pronunciation states of the target object.

4. The method of claim 3, wherein the detecting whether the voice information matches the target audio feature in the artificial question answering mode to obtain a target object detection result comprises:

determining voice characteristics according to the voice information;

determining the similarity of the voice characteristic and a plurality of candidate audio characteristics of the target object to obtain a plurality of candidate similarities; and

and determining the target object detection result according to the candidate similarities.

5. The method of any of claims 1-4, further comprising:

and under the machine question-answering mode, determining a second target interactive voice for output according to the standard interactive voice corresponding to the machine question-answering mode and the target audio characteristics.

6. The method of claim 5, wherein the machine question-answering mode comprises a first machine question-answering phase and a second machine question-answering phase, the standard interactive speech comprises a first standard interactive speech and a second standard interactive speech; the first standard interactive voice corresponding to the first machine question-answering stage is executed by a robot, and the second standard interactive voice corresponding to the second machine question-answering stage is executed by the robot or a target object indicated by the object identification; the historical reply frequency of the first standard interactive voice is higher than the historical reply frequency of the second standard interactive voice.

7. The method of claim 5, further comprising:

responding to a question-answer mode adjusting instruction, and switching the machine question-answer mode corresponding to the current object identifier into the manual question-answer mode;

determining the standard interactive voice of the machine question-answering mode and response content corresponding to the standard interactive voice;

extracting keywords from the response content to obtain key content; and

and under the condition that the machine question-answering mode is switched to the manual question-answering mode, feeding back the key content to the target object.

8. The method of claim 7, further comprising:

determining an evaluation value of the key content according to an evaluation parameter, wherein the evaluation parameter comprises a ratio of the key content to the full amount of the standard interactive voice;

and determining a question-answer mode switching priority of the corresponding machine question-answer mode according to the evaluation value of the key content, wherein the question-answer mode switching priority represents the priority of switching the machine question-answer mode to the manual question-answer mode.

9. A speech information processing apparatus comprising:

the target audio characteristic determination module is used for determining a target audio characteristic associated with the object identifier according to the object identifier associated with the received voice information, wherein the target audio characteristic represents the audio characteristic of a target object indicated by the object identifier;

and the target object detection result determining module is used for detecting whether the voice information is matched with the target audio frequency characteristics in an artificial question answering mode to obtain a target object detection result.

10. The apparatus of claim 9, further comprising:

and the first target interactive voice determining module is used for determining first target interactive voice for output according to the voice information and the target audio characteristics under the condition that the target object detection result represents that the voice information is not matched with the target audio characteristics.

11. The apparatus of claim 9, further comprising:

and the audio feature set determining module is configured to determine an audio feature set according to the target object indicated by each object identifier, where for any one of the object identifiers, the audio feature set includes a plurality of candidate audio features associated with the object identifier, and the candidate audio features represent a plurality of pronunciation states of the target object.

12. The apparatus of claim 11, wherein the target object detection result determination module comprises:

the voice characteristic determining submodule is used for determining the voice characteristics according to the voice information;

a candidate similarity determining submodule, configured to determine similarities between the speech feature and the candidate audio features of the target object, so as to obtain multiple candidate similarities; and

and the target object detection result determining submodule is used for determining the target object detection result according to the candidate similarities.

13. The apparatus of any of claims 9-12, further comprising:

and the second target interactive voice determining module is used for determining a second target interactive voice for output according to the standard interactive voice corresponding to the machine question-answering mode and the target audio frequency characteristics in the machine question-answering mode.

14. The apparatus of claim 13, wherein the machine question-answering mode comprises a first machine question-answering phase and a second machine question-answering phase, the standard interactive speech comprises a first standard interactive speech and a second standard interactive speech; the first standard interactive voice corresponding to the first machine question-answering stage is executed by a robot, and the second standard interactive voice corresponding to the second machine question-answering stage is executed by the robot or a target object indicated by the object identification; the historical response frequency of the first standard interactive voice is higher than the historical response frequency of the second standard interactive voice.

15. The apparatus of claim 13, further comprising:

the question-answer mode switching module is used for responding to a question-answer mode adjusting instruction and switching the machine question-answer mode corresponding to the current object identifier into the manual question-answer mode;

the response content determining module is used for determining the standard interactive voice of the machine question-answering mode and response content corresponding to the standard interactive voice;

the key content determining module is used for extracting key words from the response content to obtain key content; and

and the key content feedback module is used for feeding back the key content to the target object under the condition that the machine question-answering mode is switched to the manual question-answering mode.

16. The apparatus of claim 15, further comprising:

an evaluation value determining module, configured to determine an evaluation value of the key content according to an evaluation parameter, where the evaluation parameter includes a ratio of the key content to a full amount of the standard interactive voice;

and the question-answer mode switching priority determining module is used for determining the question-answer mode switching priority of the corresponding machine question-answer mode according to the evaluation value of the key content, and the question-answer mode switching priority represents the priority of switching the machine question-answer mode to the manual question-answer mode.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program stored on at least one of a readable storage medium and an electronic device, the computer program, when executed by a processor, implementing the method according to any one of claims 1-8.