CN111354350B

CN111354350B - Voice processing method and device, voice processing equipment and electronic equipment

Info

Publication number: CN111354350B
Application number: CN201911371210.0A
Authority: CN
Inventors: 袁全
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2024-04-05
Anticipated expiration: 2039-12-26
Also published as: CN111354350A

Abstract

The embodiment of the application provides a voice processing method and device, voice processing equipment and electronic equipment, wherein the method comprises the following steps: determining keywords in the acquired voice information; searching a target server corresponding to the keyword from a plurality of candidate servers; and acquiring recommended content corresponding to the voice information fed back by the target server side, so as to output the recommended content for a user. According to the embodiment of the application, the application range of the electronic equipment is improved by integrating the multiple service ends, and the utilization efficiency of the electronic equipment is improved.

Description

Voice processing method and device, voice processing equipment and electronic equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a voice processing method and device, voice processing equipment and electronic equipment.

Background

The artificial intelligence technology is a theory, a method technology and an application system for simulating, extending and expanding the intelligence of people, and electronic equipment such as smart phones, intelligent household appliances, intelligent sound boxes and the like are increasingly widely applied, and can interact with users and realize corresponding intelligent processing according to instructions sent by the users.

In the prior art, a voice client of electronic equipment such as an intelligent sound box and an intelligent mobile phone can acquire voice information of a user. The voice client can send the collected voice information to a corresponding server, the server can recognize the voice information of the user and obtain recommended content corresponding to the voice information, and then the recommended content can be sent to the voice client of the electronic equipment, and the voice client can output the recommended content.

However, the service end corresponding to each voice client is fixed, and the electronic device can only acquire the recommended content corresponding to the voice client installed by the electronic device and cannot acquire the recommended content corresponding to the service end corresponding to other voice clients, so that the utilization rate of the electronic device is low.

Disclosure of Invention

The embodiment of the application provides a voice processing method and device, voice processing equipment and electronic equipment, which are used for solving the technical problem that in the prior art, the utilization rate of the electronic equipment is low because the recommended content of a corresponding server of a voice client can be obtained only through the voice client in the electronic equipment.

Thus, in one embodiment of the present application, there is provided a speech processing method comprising:

Determining keywords in the acquired voice information;

searching a target server corresponding to the keyword from a plurality of candidate servers;

and acquiring recommended content corresponding to the voice information fed back by the target server side, so as to output the recommended content for a user.

In another embodiment of the present application, there is provided a voice processing method, including:

determining keywords in the acquired voice information;

searching target clients corresponding to the keywords from the plurality of candidate clients;

acquiring recommended content determined by the target client based on the voice information to output the recommended content for a user

In yet another embodiment of the present application, a voice processing method is provided, applied to an electronic device, including:

collecting voice information of a user;

the voice information is sent to a central server, wherein keywords in the voice information are determined by the central server, and the keywords are used for searching target servers corresponding to the keywords from a plurality of candidate servers; the target server side is used for feeding back recommended content corresponding to the voice information to the center server side; the recommended content is fed back to the electronic equipment by the central server;

And acquiring the recommended content fed back by the center server side so as to output the recommended content for the user.

In yet another embodiment of the present application, there is provided a voice processing apparatus including:

the first determining module is used for determining keywords in the acquired voice information;

the first searching module is used for searching a target server corresponding to the keyword from a plurality of candidate servers;

the first processing module is used for acquiring recommended content corresponding to the voice information fed back by the target server side so as to output the recommended content for a user.

the second determining module is used for determining keywords in the acquired voice information;

the second searching module is used for searching target clients corresponding to the keywords from the plurality of candidate clients;

and the second processing module is used for acquiring the recommended content determined by the target client based on the voice information so as to output the recommended content for the user.

In still another embodiment of the present application, there is provided a voice processing apparatus configured in an electronic device, including:

The voice acquisition module is used for acquiring voice information of a user;

the voice sending module is used for sending the voice information to a central server, wherein keywords in the voice information are determined by the central server, and the keywords are used for searching target servers corresponding to the keywords from a plurality of candidate servers; the target server side is used for feeding back recommended content corresponding to the voice information to the center server side; the recommended content is fed back to the electronic equipment by the central server;

and the third processing module is used for acquiring the recommended content fed back by the center server side so as to output the recommended content for the user.

In yet another embodiment of the present application, there is provided a voice processing apparatus including: a storage component and a processing component; the storage component stores one or more computer instructions that are invoked by the processing component;

the processing assembly is configured to:

determining keywords in the acquired voice information; searching a target server corresponding to the keyword from a plurality of candidate servers; and acquiring recommended content corresponding to the voice information fed back by the target server side, so as to output the recommended content for a user.

the processing assembly is configured to:

determining keywords in the acquired voice information; searching target clients corresponding to the keywords from the plurality of candidate clients; and acquiring recommended content determined by the target client based on the voice information, so as to output the recommended content for a user.

In yet another embodiment of the present application, there is provided an electronic device including: a storage component and a processing component; the storage component stores one or more computer instructions that are invoked by the processing component;

the processing assembly is configured to:

collecting voice information of a user; the voice information is sent to a central server, wherein keywords in the voice information are determined by the central server, and the keywords are used for searching target servers corresponding to the keywords from a plurality of candidate servers; the target server side is used for feeding back recommended content corresponding to the voice information to the center server side; the recommended content is fed back to the electronic equipment by the central server; and acquiring the recommended content fed back by the center server side so as to output the recommended content for the user.

According to the technical scheme provided by the embodiment of the application, the keywords in the acquired voice information can be determined, and the target server corresponding to the keywords is searched from the plurality of candidate servers, so that the recommended content corresponding to the voice information fed back by the target server can be acquired, and the recommended content is output for the user. By providing the selection determining operation of the multiple candidate service ends, the multiple candidate service ends can simultaneously serve the related recommending work of the voice information of the user, the feedback range of the voice service is expanded, and the utilization efficiency of the electronic equipment is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, a brief description will be given below of the drawings that are needed in the embodiments or the prior art descriptions, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a flow chart of one embodiment of a speech processing method provided herein;

FIG. 2 is a flow chart illustrating yet another embodiment of a speech processing method provided herein;

FIG. 3 is a flow chart illustrating yet another embodiment of a speech processing method provided herein;

FIG. 4 is a flow chart illustrating yet another embodiment of a speech processing method provided herein;

FIG. 5 is a flow chart illustrating yet another embodiment of a speech processing method provided herein;

FIG. 6 illustrates an example diagram of a speech processing method provided herein;

FIG. 7 is a flow chart illustrating yet another embodiment of a speech processing method provided herein;

FIG. 8 is a flow chart illustrating yet another embodiment of a speech processing method provided herein;

FIG. 9 is a schematic diagram illustrating one embodiment of a speech processing apparatus provided herein;

FIG. 10 is a schematic diagram illustrating one embodiment of a speech processing device provided herein;

FIG. 11 is a schematic diagram of a speech processing device according to another embodiment of the present application;

FIG. 12 is a schematic diagram of a further embodiment of a speech processing device provided herein;

FIG. 13 is a schematic diagram of a speech processing device according to another embodiment of the present application;

Fig. 14 shows a schematic structural diagram of still another embodiment of an electronic device provided in the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.

In some of the flows described in the specification and claims of this application and in the foregoing figures, a number of operations are included that occur in a particular order, but it should be understood that the operations may be performed in other than the order in which they occur or in parallel, that the order of operations such as 101, 102, etc. is merely for distinguishing between the various operations, and that the order of execution is not by itself represented by any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

The method and the device can be applied to the intelligent voice interaction scene of the user, and the plurality of candidate service ends are aggregated for the electronic equipment to provide the noninductive switching of the candidate service ends for the user, so that the user uses differentiated services of different chat systems, and the utilization efficiency of the electronic equipment is improved.

As described in the background section, electronic devices such as smart speakers, smartphones, etc. may typically install candidate clients. The candidate client can communicate with the responding candidate server, and the candidate client can collect voice information sent by a user, can send the collected voice information to the candidate server, and the candidate server can recognize the voice information and obtain corresponding recommended content. The server feeds back the recommended content to the candidate client, and the candidate client can output the recommended content. However, the candidate server to which the candidate client corresponds is typically fixed, e.g., the kitten's sprite typically only accesses the backend systems of the kitten's sprite, resulting in less efficient utilization of the electronic device.

In the embodiment of the present application, after the plurality of candidate service ends are integrated, the target service end corresponding to the keyword may be searched from the plurality of candidate service ends after the keyword of the acquired voice information is determined, so that the target service end may acquire the voice information, search the recommended content corresponding to the voice information, and feed the acquired recommended content back to the electronic device, and the electronic device may acquire the recommended content corresponding to the voice information fed back by the target service end, so as to output the recommended content for the user. Through integrating the plurality of candidate service ends, after the electronic equipment collects voice information, the plurality of candidate client ends can be accessed and judged to obtain a target service end, and the electronic equipment can access any one of the plurality of candidate service ends based on the voice information of a user so as to realize the noninductive switching among the plurality of candidate client ends, realize the multi-system application of the electronic equipment and improve the utilization rate of the electronic equipment.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, a flowchart of one embodiment of a voice processing method according to an embodiment of the present application may include:

101: and determining keywords in the acquired voice information.

The voice processing method provided by the embodiment of the invention can be applied to the electronic equipment which can be directly interacted with the user, the electronic equipment can correspond to the electronic equipment, and when the user needs to conduct intelligent voice interaction through the electronic equipment, recommended content can be obtained through the electronic equipment. The electronic device may include a mobile phone, an intelligent sound box, a tablet computer, a wearable device, a vehicle-mounted device, an augmented Reality (augmented Reality, AR)/Virtual Reality (VR) device, and the like, and the specific type of the electronic device is not limited in this embodiment of the present application.

The voice processing method can be applied to a central server which interacts with the electronic equipment, and a user interacts with the corresponding central server by using the electronic equipment, wherein the central server can comprise a high-performance computer, cloud computing equipment and the like. The specific type of the central server is not limited in this embodiment. The central server may correspond to a plurality of candidate service ends, and may interact with each candidate service end respectively.

Keywords in the voice information may refer to words related to the service system, such as a system, a device, a developer, a client, a server, and the like, included in the voice information.

Taking the electronic device as an example of the intelligent sound box, in the embodiment of the application, the collected and obtained voice information may be, for example, "a cat wiry, i want to listen to a song" and it may be determined that the keywords in the collected and obtained voice information are "a cat wiry" and "a song".

102: searching a target server corresponding to the keyword from a plurality of candidate servers.

Alternatively, system keywords for each candidate server may be set. The searching for the target server corresponding to the keyword from the plurality of candidate servers may include: searching a target server containing the keywords from the system keywords corresponding to the candidate servers. The keywords of each candidate server may include a plurality of keywords, and each keyword may be used to distinguish the candidate server.

Optionally, the target server corresponding to the keyword may be directly searched from the plurality of candidate servers. At this time, the electronic device or the central server may directly communicate with the plurality of candidate service terminals, so as to obtain recommended content corresponding to the voice information from the corresponding candidate service terminals.

The electronic equipment or the central server can send the voice information to the target server, so that the target server can acquire the voice information and determine recommended content corresponding to the voice information. After the target server obtains the recommended content, the recommended content may be sent to the electronic device or the central server.

The candidate server may include a voice server, and the candidate client may include a voice client. Multiple voice servers can be integrated, and target servers corresponding to keywords are searched from the multiple voice servers. In one possible design, when the electronic device is an intelligent sound box, the intelligent sound box can provide voice interaction service for a user, and in order to expand an application range of the intelligent sound box, the intelligent sound box can perform function integration on a plurality of voice service ends, and an appropriate target service end can be selected from the plurality of voice service ends according to keywords so as to acquire recommended content from the target service end. And when the key words of the voice information are "Tianmao eidolon" and "A-certain", the Tianmao eidolon server can be selected from the "Tianmao eidolon server" and the "small server" as the target server to obtain corresponding recommended content from the "Tianmao eidolon server".

The intelligent sound box can be switched to different voice service ends at any time to obtain corresponding services, so that the application range of the intelligent sound box is expanded, and the utilization efficiency of the intelligent sound box is improved.

103: and acquiring recommended content corresponding to the voice information fed back by the target server side, so as to output the recommended content for a user.

Alternatively, recommended content sent by the target server may be received. When the recommended content is output for the user, the recommended content can be output through a display screen, and the recommended content can be played through voice. Taking the intelligent sound box as an example, when the intelligent sound box plays the recommended content for the user, the recommended content can be played in a voice mode.

The target server side can receive voice information sent by the electronic equipment. In some embodiments, after searching the target server corresponding to the keyword from the plurality of candidate servers, the method further includes:

and sending the voice information to the target server.

The voice information is used for determining recommended content corresponding to the target server side by the target server side.

After receiving voice information sent by the electronic equipment, the target server can determine recommended content corresponding to the voice information. The target server side can feed back recommended content to the electronic equipment.

As shown in fig. 2, a flowchart of another embodiment of a voice processing method according to an embodiment of the present application may include the following steps:

201: and collecting voice information of the user.

The voice processing method provided by the embodiment of the application can be applied to electronic equipment.

The electronic device may be provided with a voice acquisition component, for example, a microphone, through which voice information of the user may be acquired.

202: and determining keywords in the acquired voice information.

Keywords in the voice information can be obtained through text conversion and semantic recognition processing.

203: searching a target server corresponding to the keyword from a plurality of candidate servers.

204: and acquiring recommended content corresponding to the voice information fed back by the target server side, so as to output the recommended content for a user.

Some steps of the embodiments of the present application are the same as those of the previous embodiments, and are not repeated here.

As shown in fig. 3, a flowchart of yet another embodiment of a voice processing method according to an embodiment of the present application may include the following steps:

301: and acquiring voice information sent by the electronic equipment.

The voice processing method provided by the embodiment of the application can be applied to electronic equipment, wherein the voice information is acquired by the electronic equipment.

The electronic device is an application program configured in the electronic device used by a user. The electronic device may collect voice information of the user.

The voice processing method in the embodiment of the application can be applied to a central server, and the central server can be composed of equipment with high processing effect, such as a high-performance computer, a cloud computing device and the like. The center server can be communicated with the electronic equipment, and the center server can be communicated with a plurality of candidate service ends so as to realize information transmission with the plurality of candidate service ends. The electronic equipment can collect voice information of the user and output the recommended content for the user after acquiring the recommended content.

302: and determining keywords in the received and obtained voice information.

303: searching a target server corresponding to the keyword from a plurality of candidate servers.

304: and acquiring recommended content corresponding to the voice information fed back by the target server side, so as to output the recommended content for a user.

The obtaining the recommended content corresponding to the voice information fed back by the target server, to output the recommended content for the user may include: acquiring recommended content corresponding to the voice information fed back by the target server; and sending the recommended content to the electronic equipment so that the electronic equipment outputs the recommended content for the user.

In the embodiment of the application, the electronic device can collect the voice information and send the voice information to the central server, and the central server can acquire the voice information sent by the electronic device. The central server can determine keywords in the voice information, and find target service ends corresponding to the keywords from a plurality of candidate service ends, the central server can communicate with the plurality of candidate service ends, and can respectively apply different candidate service ends, so that after the corresponding target service ends are found based on the keywords in the voice information, the targeted application of the service ends can be realized, the appropriate target service ends can be selected from the different candidate service ends, the application range of the central server is expanded, a user can simultaneously use services provided by the plurality of candidate service ends through the electronic equipment, the application range of the electronic equipment used by the user is improved, and the application efficiency of the electronic equipment is improved.

Generally, when processing voice information, the voice information may be converted into text information, and text information obtained through semantic recognition conversion may be obtained to obtain keywords in the voice information. Thus, as an embodiment, the determining keywords in the collected voice information may include:

converting the voice information into text information;

and carrying out semantic recognition processing on the text information to obtain keywords in the text information.

Optionally, the converting the voice information into text information may include: and converting the voice information into text information through a voice recognition algorithm. The speech recognition algorithm employed may include a deep neural network model, CTC (Connectionist temporal classification, joint sense time classification) algorithm, etc., and will not be described in detail herein.

In some of these embodiments, the searching for the target server corresponding to the keyword from the plurality of candidate servers includes:

determining system keywords of each of the plurality of candidate service ends;

searching target system keywords matched with the keywords from the system keywords of each of the plurality of candidate service ends;

And determining the candidate server corresponding to the target system keyword as the target server.

The system keywords of each candidate server may include a plurality of candidate servers, and in this embodiment of the present application, in order to distinguish different candidate servers, the system keywords of the candidate servers may be used to identify the candidate server, and the system keywords may be used to distinguish corresponding candidate servers.

As a possible implementation manner, the client names of the candidate clients may be used as system keywords, and the determining the system keywords of each of the plurality of candidate servers may include:

determining client names of the candidate clients corresponding to the candidate clients respectively;

and taking the client names corresponding to the candidate service ends as the system keywords of the candidate service ends to obtain a plurality of voice keywords.

In addition, in some embodiments, the service end names of the candidate service ends may be used as system keywords, developer information, system version information and the like may be used as system keywords, and any word capable of distinguishing the candidate service ends may be used as the system keywords.

As yet another possible implementation manner, the system keywords of each of the plurality of candidate service ends may be determined by:

determining at least one candidate word;

sequentially determining candidate service ends corresponding to the at least one candidate word respectively, and obtaining target candidate words corresponding to the plurality of candidate service ends respectively;

and determining target candidate words corresponding to any candidate server as system keywords of the candidate server, and obtaining the system keywords of each candidate server.

Candidate words may refer to individual words, phrases, or sentence models that contain word and sentence structures, which may be used to identify corresponding candidate servers. In some embodiments, the number of candidate words is very large, so the candidate words may be partitioned into candidate servers to distinguish between different candidate servers using the candidate words. Each candidate server may correspond to a plurality of candidate words, where the plurality of candidate words corresponding to the candidate server form a system keyword of the server. The system keywords of the candidate service end can be multiple, and the system keywords can be used for distinguishing different candidate words.

In some embodiments, the sequentially determining candidate service ends corresponding to the at least one candidate word respectively, and obtaining target candidate words corresponding to the plurality of candidate service ends respectively includes:

And sequentially determining the candidate service ends respectively corresponding to the at least one candidate word according to the functional attributes respectively corresponding to the plurality of candidate service ends, and obtaining target candidate words respectively corresponding to the plurality of candidate service ends.

The functional attribute corresponding to the candidate service end may include network service provided by the candidate service end for the user. For example, the music software, that is, the client may generally provide the user with functional services such as song playing, song library query, song recommendation, etc., and the functional attributes of the service end corresponding to the music software, that is, the candidate service end, may include audio searching, audio recommendation, user feedback, etc. The at least one candidate word can be respectively divided into different candidate service ends according to the word meaning of the at least one candidate word, and target candidate words respectively corresponding to the plurality of candidate service ends are obtained.

Further, optionally, determining, in order, the candidate service ends corresponding to the at least one candidate word according to the functional attributes corresponding to the plurality of candidate service ends respectively, and obtaining the target candidate words corresponding to the plurality of candidate service ends respectively may include:

and determining target candidate words matched with the functional attributes from the at least one candidate word aiming at the functional attributes of any candidate server to obtain target candidate words respectively corresponding to the plurality of candidate servers.

Optionally, for the functional attribute of any candidate service end, a target candidate word with a word meaning matched with the functional attribute of the candidate service end may be determined from the at least one candidate word, so as to obtain target candidate words respectively corresponding to the plurality of candidate service ends.

In one possible design, the at least one candidate word may be determined by:

acquiring a history keyword corresponding to the history voice information;

determining target historical keywords meeting selection conditions from the historical keywords;

and determining the target historical keyword as the at least one candidate word.

The selection of the candidate words may be obtained based on the selection of the history keyword, specifically, the history keyword corresponding to the history voice information may be obtained, and at least one candidate word satisfying the selection condition may be selected from the history keywords. Because the historical keywords correspond to the historical service ends accessed by the user, the corresponding candidate service ends can be directly determined through the historical keywords, and the word distribution efficiency can be improved.

Therefore, the sequentially determining the candidate service ends respectively corresponding to the at least one candidate word, and obtaining the target candidate words respectively corresponding to the plurality of candidate service ends may include: and sequentially determining the history service ends respectively corresponding to the at least one candidate word, wherein the history service ends are the candidate service ends, and the at least one candidate word can be respectively corresponding to the associated candidate service ends so as to obtain target candidate words respectively corresponding to the plurality of candidate service ends.

After searching the target server corresponding to the keyword from the plurality of candidate servers, the method may further include: and associating the target server side with the keywords, and taking the keywords as a system keyword of the target server side.

In some embodiments, the determining the target history keyword satisfying the selection condition from the history keywords may include:

determining the occurrence number of each history keyword;

and determining the historical keywords with the occurrence times larger than the frequency threshold as target historical keywords.

In yet another possible design, the at least one candidate word may be determined by:

extracting voice attribute information in the voice information;

determining user attribute information of the user based on the voice attribute information;

and determining at least one candidate word with an association relation with the user attribute information.

Optionally, voice attribute information of the user may be extracted based on voice information of the user, pronunciation characteristics and speaking content of the user may be dataized, so as to identify identity of the user through the voice characteristics, analyze preference and dialect of the user, and thus, words related to the user may be determined by using the identity characteristics of the user, and at least one candidate word may be obtained.

The voice attribute information of the user may include information such as a tone, audio, tone color, language used for speaking, i.e., dialect, grammar, etc. of the user when speaking. The user attribute information such as the identity, age, location of the region, interested content, historical access server and the like of the user can be determined by utilizing the voice attribute information of the user, so that at least one candidate word with an association relationship with the attribute information of the user can be determined.

Alternatively, at least one candidate word having an association relationship with the attribute information may be searched for from the word library based on the attribute information of the user.

and detecting system keywords set by the user aiming at any candidate server to obtain the system keywords of each of the plurality of candidate servers.

The user may set keywords for candidate servers or candidate clients. The system keywords of the server are the same as those of the client.

Because of regional restrictions, the intonation, language family, grammar, etc. used by users in different regions may be different, as a further embodiment, the method may further comprise:

And determining voice attribute information of the voice information.

The obtaining the recommended content corresponding to the voice information fed back by the target server, so as to output the recommended content for the user includes:

acquiring recommended content corresponding to the voice information fed back by the target server;

and outputting the recommended content based on the voice attribute information.

Alternatively, the voice attribute information of the voice information may include information of various aspects such as language, dialect, family, grammar and/or intonation used by the voice information. The language may refer to a language type used when the user utters a voice, for example, different language types such as chinese, english, spanish, etc. The dialect specification may refer to a language specific to a voice uttered by a user, for example, a language used in areas such as cantonese and mandarin chinese. The term "language" refers to a part of the historic comparative linguistic that is divided by the relatives of different languages. Grammar refers to the parts of speech, the meandering variations of words, or other means of interrelationships and the functions and relationships of words in sentences, which are applied according to a determined usage. Intonation may refer to information such as rising and falling, and leveling in voice information, through which emotion of a user can be judged, for example, the user utters voice under a specific thought emotion.

In some embodiments, the outputting the recommended content based on the voice attribute information may include:

converting the recommended content into a first recommended voice corresponding to the voice attribute information;

and outputting the first recommended voice.

Optionally, the converting the recommended content into the first recommended voice corresponding to the voice attribute information may include: and generating a first recommended voice corresponding to the recommended content by taking the voice attribute information as a voice generation parameter.

When the recommended content is converted into the first recommended voice corresponding to the target language, the recommended content may be converted into the first recommended voice according to the information such as the language, accent, language family, grammar, and/or the homologous word in the voice attribute information. For example, when the voice attribute information of the voice information is chinese or cantonese, the recommended content may be converted into a first recommended voice composed of a cantonese grammar and words based on chinese. For another example, a plurality of rising tones are included in the intonation in the voice attribute information in the voice information, and the emotion of the user may not be stable enough, and therefore, the first recommended voice may be set to a flat tone or a falling tone.

In one possible design, the converting the recommended content into the first recommended speech corresponding to the target language includes:

and if the user is located in the self-service place, converting the recommended content into first recommended voice corresponding to the target language.

In the self-service place, the electronic equipment can be a self-service terminal, and a user can interact with the self-service terminal by using voice to realize the self-defined processing of related products and services in the self-service place.

In certain embodiments, the method may further comprise:

determining user attribute information of the user based on the voice attribute information of the voice information;

the plurality of candidate service ends are determined by:

and determining a plurality of candidate service ends with association relation with the user attribute information.

Besides determining at least one candidate word with an association relationship with the attribute information of the user through the attribute information of the user, a plurality of candidate service ends with the association relationship with the user can be determined through the attribute information of the user, so that the candidate service ends with higher association degree with the user can be obtained, targeted service end screening is realized, and the selection efficiency of the service ends is improved.

In the embodiment of the application, the recommended content is converted into the voice to be output for the user, so that the user can quickly acquire the recommended content, the user does not need to execute other operations again, the noninductive interaction is realized, and the application efficiency of the electronic equipment is improved.

In order to improve the experience of voice interaction and improve the application range of the electronic device, as a further embodiment, the method may further include:

determining sex attribute information of the voice information;

and outputting the recommended content to the user based on the gender attribute information.

The sex attribute information included in the voice information may specifically refer to a sex attribute of a user who uttered a voice, and for example, when a sound uttered by the user is determined to be female sound, the sex attribute of the voice may be determined to be female. When the name of a male is contained in the voice information of the user, the sex attribute of the voice can be determined to be male.

As a possible implementation manner, the determining the gender attribute information of the voice information may include:

Sex attribute information of the voice information is determined based on the characteristic information of the voice information.

As yet another possible implementation manner, the determining the gender attribute information of the voice information includes:

converting the voice information into text information;

carrying out semantic recognition processing on the text information to obtain name keywords in the text information;

and determining the sex attribute information of the voice information by using the name keyword.

In some embodiments, the determining the gender attribute information of the voice information using the name keyword includes:

if the name keyword is matched with the first type of names, determining that the gender attribute information of the voice information is first attribute information;

and if the name keyword is matched with the second class of names, determining the gender attribute information of the voice information as second attribute information.

The first type of names may be a set of female names and the second type of names may be a set of male names. Matching name keywords with first class names may include including names in the first class names that have a similarity to the name keywords that exceeds a similarity threshold. Matching name keywords with the second type of names may include including names in the second type of names that have a similarity to the name keywords that exceeds a similarity threshold. The similarity threshold may be set according to the similarity requirement for two names, the higher the similarity, the more similar the two names are. The first type of names and the second type of names may be name lexicons obtained based on statistics or the like. By classifying names with different attributes to perform voice output control, personalized output of recommended content can be realized, multi-level application of the electronic equipment is provided, and application efficiency of the electronic equipment is improved.

In one possible design, the outputting the recommended content for the user based on the gender attribute information includes:

the gender attribute information is used as a sound generation parameter to generate a second recommended voice corresponding to the recommended content;

and outputting the second recommended voice.

And when the gender attribute information is used as a sound generation parameter and a second recommended voice corresponding to the recommended content is generated, controlling the voice output tone of the second recommended voice to be the same as the gender attribute information. For example, when the sex attribute information is male attribute information, the sex attribute information of the second recommended voice at the time of outputting is also male attribute information.

As shown in fig. 4, a flowchart of another embodiment of a speech processing method according to an embodiment of the present application may include the following steps:

401: and collecting voice information of the user.

Some steps of the embodiments of the present application are the same as those of the foregoing embodiments, and are not repeated here.

402: and identifying keywords in the voice information.

403: searching a target server corresponding to the keyword from a plurality of candidate servers.

404: and sending the voice information to the target server side so that the target server side can determine recommended content corresponding to the voice information and feed back the recommended content.

405: and acquiring the recommended content fed back by the target server.

406: and determining voice attribute information and gender attribute information of the voice information.

407: and generating a third recommended voice corresponding to the recommended content by taking the voice attribute information and the gender attribute information as voice generation parameters.

408: and outputting the third recommended voice.

In the embodiment of the application, the electronic device can collect the voice information of the user and can identify the keywords in the voice information, so that the target server can be queried from a plurality of candidate servers by utilizing the keywords to send the voice information to the target server. The target server side can obtain the recommended content corresponding to the voice information, the recommended content is fed back to the electronic equipment, and the electronic equipment can obtain the recommended content fed back by the target server side. The electronic equipment can establish a connection relationship with a plurality of candidate service ends, so that the system application of the plurality of candidate service ends of the electronic equipment can be realized, and the application efficiency of the electronic equipment is improved. In addition, the voice attribute information and the sex attribute information of the voice information can be determined, the voice attribute information and the sex attribute information can be used as sound generation parameters, the third recommended voice corresponding to the recommended content is generated and output, personalized output of the recommended content is realized, and the application range of the electronic equipment is improved.

As shown in fig. 5, a flowchart of another embodiment of a voice processing method provided in an embodiment of the present application is applied to an electronic device, where the method may include the following steps:

501: collecting voice information of a user;

502: and sending the voice information to a center server.

The key words in the voice information are determined by the central server, and the key words are used for searching target servers corresponding to the key words from a plurality of candidate servers; the target server side is used for feeding back recommended content corresponding to the voice information to the center server side; and the recommended content is fed back to the electronic equipment by the central server.

503: and acquiring the recommended content fed back by the center server side so as to output the recommended content for the user.

The embodiment of the application can be applied to electronic equipment, the electronic equipment can comprise an intelligent sound box, an intelligent mobile phone and the like, and the electronic equipment can provide a voice interaction function for a user and a voice interaction function.

The central server in the embodiment of the present application may execute the voice processing method shown in fig. 3, by acquiring voice information sent by the electronic device, and after determining a keyword in the voice information, searching a target server corresponding to the keyword from multiple candidate servers, so as to acquire recommended content corresponding to the voice information from the target server. The electronic equipment can acquire the network services provided by a plurality of candidate service ends through collecting the voice information of the user and through data transmission with the center service end, has no sense on the processing and switching process of the network services, expands the application range of the electronic equipment under the condition of not influencing the use of the user, and improves the utilization rate of the electronic equipment.

For easy understanding, the technical scheme of the embodiment of the application is described in detail by taking the electronic device as an intelligent sound box as an example. As shown in fig. 6, the smart speaker M1 may establish wireless communication with a plurality of candidate service terminals M2, where a communication relationship between the two is represented by a wireless transmission symbol.

The candidate server may be a server composed of a computer or a cloud server.

The intelligent sound box M1 may collect the voice information 601 of the user U and identify the keyword 602 in the voice information, and then, the intelligent sound box M1 may search the target server 603 corresponding to the keyword from the plurality of candidate servers M2. Assuming that the found target server is M2S in M2, the smart speaker M1 may send 604 the voice information to the target server M2S. The wireless transmission symbol is not shown between the intelligent sound box M1 and the target server M2S.

Then, the target server M2S may obtain the recommended content corresponding to the voice information 605 and feed back the recommended content.

Then, the smart speaker M1 may acquire the recommended content 606 fed back by the target server M2S, so that the smart speaker M1 outputs the recommended content 607.

The intelligent sound box M1 is connected with a plurality of candidate service ends M2, and the collected voice information is subjected to system judgment to access corresponding target service ends, so that access switching of different candidate service ends is realized, the application range of the intelligent sound box is expanded, and the utilization efficiency of the intelligent sound box is improved.

As shown in fig. 7, a flowchart of another embodiment of a voice processing method according to an embodiment of the present application may include:

701: and determining keywords in the acquired voice information.

Optionally, the technical solution of the embodiment of the present application may be applied to an electronic device, where the electronic device may integrate a plurality of candidate clients, where the candidate clients may include a voice client, and the electronic device may simultaneously correspond to the plurality of candidate clients, and may obtain customer services provided by the plurality of candidate clients. For example, a plurality of music software may be installed in the electronic device to simultaneously use network services provided by the plurality of music software.

702: and searching target clients corresponding to the keywords from the plurality of candidate clients.

703: and acquiring recommended content determined by the target client based on the voice information, so as to output the recommended content for a user.

The acquiring the recommended content determined by the target client based on the voice information may include: and sending the voice information to a corresponding target server through the target client so that the target server can determine recommended content corresponding to the voice information and feed back the recommended content to the target client, so that the electronic equipment can acquire the recommended content through the target client.

In this embodiment, by integrating a plurality of candidate clients, a keyword in the voice information can be identified after the voice information of the user is collected, so that a target client corresponding to the keyword can be searched from the plurality of candidate clients, and the recommended content determined by the voice information can be obtained through the target client, so that the recommended content can be output for the user. The electronic equipment integrates a plurality of candidate clients to realize comprehensive application of the plurality of candidate clients so as to obtain voice services respectively provided by the plurality of candidate clients. Each candidate client can communicate with the corresponding candidate server, and the electronic equipment obtains background services of the multiple candidate servers by utilizing the multiple candidate clients, so that the application efficiency of the electronic equipment is improved.

Generally, when processing voice information, the voice information may be converted into text information, and text information obtained through semantic recognition conversion may be obtained to obtain keywords in the voice information. As one embodiment, the determining the keywords in the collected voice information includes:

converting the voice information into text information;

In some embodiments, the searching for the target client corresponding to the keyword from the plurality of candidate clients may include:

determining system keywords of each of the plurality of candidate clients;

searching target system keywords matched with the keywords from the system keywords of each of the candidate clients;

And determining the candidate client corresponding to the target system keyword as the target client.

The system keywords of each candidate client may include a plurality of candidate clients, and in this embodiment, in order to distinguish different candidate clients, the system keywords of different candidate clients may be different, and each system keyword may be used to distinguish the corresponding candidate client.

As one possible implementation manner, the client names of the candidate clients may be used as system keywords, and determining the system keywords of each of the plurality of candidate clients may include:

and determining the client names of the plurality of candidate clients as system keywords of the plurality of candidate clients.

In addition, in some embodiments, the client names of the candidate clients may be used as system keywords, the developer information, the system version information and the like may be used as system keywords, and any word that can distinguish the candidate clients may be used as the system keywords.

As yet another possible implementation manner, the system keywords of each of the plurality of candidate clients may be determined by:

Determining at least one candidate word;

sequentially determining candidate clients corresponding to the at least one candidate word respectively, and obtaining target candidate words corresponding to the plurality of candidate clients respectively;

and confirming that the target candidate word corresponding to any candidate client is the system keyword of the candidate client, and obtaining the system keywords of each candidate client.

The candidate clients in the embodiment of the application have a corresponding relationship with the candidate service ends in the embodiment, any one of the candidate clients can have the corresponding candidate service end, and the candidate client and the candidate service end form a system for providing interactive service for the user. The candidate service end provides services such as data interaction, data support, data query and the like for the corresponding candidate client end. The system keywords of the candidate client may be the same as the system keywords of the corresponding candidate server, i.e. the same system keywords may be used by the server and the client in a system providing interactive services for the user.

It should be noted that, in the embodiment of the present application, the matching process and steps of at least one candidate word and at least one candidate word of the candidate client and the corresponding system keyword are the same as the matching process and steps of at least one candidate word and at least one candidate word of the candidate server and the corresponding system keyword in the above embodiment, and are not repeated herein.

In some embodiments, the sequentially determining the candidate clients corresponding to the at least one candidate word respectively, and obtaining the target candidate words corresponding to the plurality of candidate clients respectively may include:

and sequentially determining the candidate clients corresponding to the at least one candidate word according to the functional attributes corresponding to the candidate clients respectively to obtain target candidate words corresponding to the candidate clients respectively.

Further, optionally, determining, in order, the candidate clients corresponding to the at least one candidate word according to the functional attributes corresponding to the plurality of candidate clients, and obtaining the target candidate word corresponding to the plurality of candidate clients may include:

and determining target candidate words matched with the functional attributes from the at least one candidate word aiming at the functional attributes of any candidate client so as to obtain target candidate words respectively corresponding to the plurality of candidate clients.

As yet another possible implementation manner, the system keywords of each of the plurality of candidate clients are determined by:

and detecting system keywords set by the user for any candidate client so as to obtain the system keywords of each of the plurality of candidate clients.

In some embodiments, after determining keywords in the collected voice information, the method may further include:

determining voice attribute information of the voice information;

the obtaining the recommended content determined by the target client based on the voice information, so as to output the recommended content for the user comprises the following steps:

acquiring recommended content determined by the target client based on the voice information;

As an embodiment, the method may further include:

the plurality of candidate clients may be determined by:

and determining a plurality of candidate clients which have association relations with the user attribute information.

It should be noted that, part of the steps in the embodiments of the present application are the same as those in the foregoing embodiments, and are not repeated here.

and determining voice attribute information of the voice information.

The outputting the recommended content for the user includes:

and outputting the first recommended voice.

determining sex attribute information of the voice information;

the obtaining the recommended content determined by the target client based on the voice information to output the recommended content for the user may include:

and acquiring recommended content determined by the target client based on the voice information.

converting the voice information into text information;

The first type of names may be a set of female names and the second type of names may be a set of male names. By classifying names with different attributes to perform voice output control, personalized output of recommended content can be realized, multi-level application of the electronic equipment is provided, and application efficiency of the electronic equipment is improved.

and outputting the second recommended voice.

It should be noted that, part of the steps in the embodiments of the present application are described in detail in the foregoing embodiments, and are not repeated here.

As shown in fig. 8, a flowchart of yet another embodiment of a voice processing method according to an embodiment of the present application may include the following steps:

801: and collecting voice information of the user.

802: and identifying keywords in the voice information.

803: and searching target clients corresponding to the keywords from a plurality of candidate clients.

804: and acquiring recommended content determined by the target client based on the voice information.

805: and determining voice attribute information and gender attribute information of the voice information.

806: and generating a third recommended voice corresponding to the recommended content by taking the voice attribute information and the gender attribute information as voice generation parameters.

807: and outputting the third recommended voice.

In the embodiment of the application, the electronic device can collect the voice information of the user and can identify the keywords in the voice information, so that the target client can be queried from a plurality of candidate clients by utilizing the keywords, and the recommended content corresponding to the voice information is obtained through the target client. Each candidate client can communicate with the corresponding candidate server, and the electronic equipment obtains background services of multiple candidate servers by utilizing the multiple candidate clients, so that the application efficiency of the electronic equipment is improved. In addition, the voice attribute information and the sex attribute information of the voice information can be determined, the voice attribute information and the sex attribute information can be used as sound generation parameters, the third recommended voice corresponding to the recommended content is generated and output, personalized output of the recommended content is realized, and the application range of the electronic equipment is improved.

As shown in fig. 9, a schematic structural diagram of an embodiment of a speech processing device according to an embodiment of the present application may include:

the first determination module 901: and the method is used for determining keywords in the acquired voice information.

The first lookup module 902: and the target server is used for searching the target server corresponding to the keyword from the plurality of candidate servers.

The first processing module 903: and the recommendation content is used for acquiring the recommendation content corresponding to the voice information fed back by the target server side so as to output the recommendation content for the user.

In the embodiment of the application, after the acquired voice information is acquired, the keywords in the voice information are identified, the keywords can be used for searching the target server from the plurality of candidate servers, the selection of the plurality of candidate servers is realized, the recommended content provided by the target server is directly acquired, and the switching process does not need user operation, so that the application efficiency is improved. The electronic equipment can simultaneously correspond to a plurality of candidate service ends, so that the comprehensive application of the multi-voice system is realized, and the application efficiency of the electronic equipment is improved.

In some embodiments, the second determining module may include:

the text conversion unit is used for converting the voice information into text information;

The word processing unit is used for carrying out semantic recognition processing on the text information to obtain keywords in the text information.

As an embodiment, the first search module may include:

a first determining unit, configured to determine system keywords of each of the plurality of candidate service ends;

and the word matching unit is used for searching target system keywords matched with the keywords from the system keywords of each of the plurality of candidate service ends.

The target determining unit is used for determining the candidate server corresponding to the target system keyword as the target server.

In some embodiments, the first determining unit may include:

a first determining subunit, configured to determine client names of respective candidate clients corresponding to the plurality of candidate servers;

the first obtaining subunit is configured to use the client names corresponding to the plurality of candidate service ends as system keywords of the plurality of candidate service ends.

In some embodiments, the first determining unit may include:

a second determining subunit configured to determine at least one candidate word;

the second obtaining subunit is used for sequentially determining candidate service ends corresponding to the at least one candidate word respectively, and obtaining target candidate words corresponding to the plurality of candidate service ends respectively;

And the third obtaining subunit is used for determining that the target candidate word corresponding to any candidate server is the system keyword of the candidate server and obtaining the system keywords of each candidate server.

As one possible implementation manner, the second obtaining subunit may include:

the word matching module is used for sequentially determining the candidate service ends respectively corresponding to the at least one candidate word according to the functional attributes respectively corresponding to the plurality of candidate service ends, and obtaining target candidate words respectively corresponding to the plurality of candidate service ends.

Further optionally, the word matching module may include:

the word matching unit is used for determining target candidate words matched with the functional attributes from the at least one candidate word aiming at the functional attributes of any candidate server to obtain target candidate words respectively corresponding to the candidate servers.

As yet another possible implementation manner, the second determining subunit may include:

the first acquisition module is used for acquiring historical keywords corresponding to the historical voice information;

the first selection module is used for determining target historical keywords meeting selection conditions from the historical keywords;

And the first target module is used for determining the target historical keywords as the at least one candidate word.

In some embodiments, the target selection module may include:

a first statistical unit for determining the occurrence number of each history keyword;

and the word selecting unit is used for determining the historical keywords with the occurrence times larger than the frequency threshold as target historical keywords.

and the first extraction module is used for extracting voice attribute information in the voice information.

And a third determining module, configured to determine user attribute information of the user based on the voice attribute information.

And the candidate determining module is used for determining at least one candidate word with an association relation with the user attribute information.

In some embodiments, the first determining unit may include:

and the first detection subunit is used for detecting the system keywords set by the user for any candidate server to obtain the system keywords of each of the candidate servers.

As an embodiment, the apparatus may further include:

a fourth determining module, configured to determine voice attribute information of the voice information;

The first processing module includes:

the content acquisition unit is used for acquiring recommended content corresponding to the voice information fed back by the target server;

and the first output unit is used for outputting the recommended content based on the voice attribute information.

As a possible implementation manner, the first output unit may include:

a first conversion subunit, configured to convert the recommended content into a first recommended voice corresponding to the voice attribute information;

and the first output subunit is used for outputting the first recommended voice.

Further optionally, the converting the first converter subunit may specifically be configured to:

and if the user is located in the self-service place, converting the recommended content into first recommended voice corresponding to the voice attribute information.

In some embodiments, the apparatus may further comprise:

and a fifth determining module, configured to determine user attribute information of the user based on the voice attribute information of the voice information.

And a sixth determining module, configured to determine a plurality of candidate service ends that have an association relationship with the user attribute information.

As an embodiment, the apparatus may further include:

A gender determination module for determining gender attribute information of the voice information;

the first processing module includes:

the content acquisition three units are used for acquiring recommended content corresponding to the voice information fed back by the target server;

and the second output unit is used for outputting the recommended content for the user based on the gender attribute information.

As one possible implementation, the sexing module may include:

and a second determining unit configured to determine sex attribute information of the voice information based on feature information of the voice information.

As yet another possible implementation manner, the gender determining module may include:

the name extraction unit II is used for carrying out semantic recognition processing on the text information to obtain name keywords in the text information;

and a third determining unit for determining sex attribute information of the voice information by using the name keyword.

Further, optionally, the third determining unit may include:

a first matching subunit, configured to determine, if the name keyword matches a first type name, that sex attribute information of the voice information is first attribute information;

And the second matching subunit is used for determining that the sex attribute information of the voice information is second attribute information if the name keyword is matched with the second class of names.

As a further possible implementation manner, the second output unit includes:

a first generation subunit, configured to generate a second recommended voice corresponding to the recommended content, with the gender attribute information as a sound generation parameter;

and the second output subunit is used for outputting the second recommended voice.

As an embodiment, the first determining module may include:

the first acquisition unit is used for acquiring the voice information of the user;

and the fourth determining unit is used for determining keywords in the acquired voice information.

As yet another embodiment, the first determining module may include:

the first acquisition unit is used for acquiring voice information sent by the electronic equipment; the voice information is acquired by the electronic equipment;

and a fifth determining unit for determining keywords in the received and obtained voice information.

In some possible designs, the apparatus may further comprise:

the first sending module is used for sending the voice information to the target server; the voice information is used for determining recommended content corresponding to the target server side by the target server side.

The second output unit may include:

a second generation subunit: and generating a third recommended voice corresponding to the recommended content by taking the voice attribute information and the gender attribute information as voice generation parameters.

A voice output subunit: and outputting the third recommended voice.

The speech processing device shown in fig. 5 may perform the speech processing method described in the embodiments shown in fig. 1 to 4, and its implementation principle and technical effects are not repeated. The specific manner in which the individual modules, units, and sub-units of the speech processing apparatus in the above embodiments perform the operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

The speech processing apparatus shown in fig. 9 may be configured as a speech processing device, as shown in fig. 10, which is a schematic structural diagram of an embodiment of a speech processing device according to an embodiment of the present application, including: a storage component 1001 and a processing component 1002; the storage component stores one or more computer instructions that are invoked by the processing component 1001;

the processing component 1002 is configured to:

As an embodiment, the processing component determines that keywords in the collected voice information may specifically be: converting the voice information into text information; and carrying out semantic recognition processing on the text information to obtain keywords in the text information.

As a possible implementation manner, the searching, by the processing component, for a target server corresponding to the keyword from a plurality of candidate servers may specifically be:

determining system keywords of each of the plurality of candidate service ends;

searching target system keywords corresponding to the keywords from the system keywords of each of the plurality of candidate service ends;

In some embodiments, the determining, by the processing component, the system keywords of each of the plurality of candidate service ends may specifically be:

and taking the client names corresponding to the candidate service ends as the system keywords of the candidate service ends.

As one embodiment, the processing component may determine the system keywords of each of the plurality of candidate servers by:

determining at least one candidate word;

As a possible implementation manner, the processing component sequentially determines candidate service ends corresponding to the at least one candidate word, and the obtaining target candidate words corresponding to the plurality of candidate service ends respectively may specifically be:

Further, optionally, the processing component sequentially determines, according to the functional attributes respectively corresponding to the plurality of candidate service ends, candidate service ends respectively corresponding to the at least one candidate word, and the obtaining, by the processing component, target candidate words respectively corresponding to the plurality of candidate service ends may specifically be:

As yet another possible implementation, the processing component may determine the at least one candidate word by:

acquiring a history keyword corresponding to the history voice information;

Further, optionally, the determining, by the processing component, from the history keywords, a target history keyword that satisfies a selection condition may specifically be:

Determining the occurrence number of each history keyword;

In some embodiments, the processing component determines the at least one candidate word by:

extracting voice attribute information in the voice information; determining user attribute information of the user based on the voice attribute information; and determining at least one candidate word with an association relation with the user attribute information.

As yet another possible implementation manner, the processing component may determine the system keywords of each of the plurality of candidate service ends by:

As an embodiment, the processing component may also be configured to:

determining voice attribute information of the voice information;

As a possible implementation manner, the processing component outputs the recommended content based on the voice attribute information may specifically be:

and outputting the first recommended voice.

In some embodiments, the converting, by the processing component, the recommended content into the first recommended voice corresponding to the voice attribute information may specifically be:

As yet another possible implementation, the processing component may be further configured to:

the processing component may determine a plurality of candidate servers by:

As yet another embodiment, the processing component may be further configured to:

determining sex attribute information of the voice information;

As a possible implementation manner, the determining, by the processing component, sex attribute information of the voice information may specifically be:

As yet another possible implementation manner, the determining, by the processing component, sex attribute information of the voice information may specifically be:

converting the voice information into text information;

In some embodiments, the determining, by the processing component using the name keyword, gender attribute information of the voice information may specifically be:

As a possible implementation manner, the outputting, by the processing component, the recommended content for the user based on the gender attribute information may specifically be:

generating a second recommended voice corresponding to the recommended content by taking the gender attribute information as a sound generation parameter;

and outputting the second recommended voice.

As an embodiment, the processing component may also be configured to:

and determining voice attribute information and gender attribute information of the voice information.

The processing component may specifically output the recommended content for the user:

and generating a third recommended voice corresponding to the recommended content by taking the voice attribute information and the gender attribute information as voice generation parameters. And outputting the third recommended voice.

As an embodiment, the processing component determines that keywords in the collected voice information may specifically be:

collecting voice information of the user;

and determining keywords in the acquired voice information.

As yet another embodiment, the determining, by the processing component, keywords in the collected voice information may specifically be:

acquiring voice information sent by electronic equipment; the voice information is acquired by the electronic equipment;

And determining keywords in the received and obtained voice information.

In certain embodiments, the processing component may further be configured to:

and sending the voice information to the target server.

The speech processing device shown in fig. 10 may perform the speech processing method described in the embodiments shown in fig. 1 to 4, and its implementation principle and technical effects are not repeated. The specific manner in which the processing components of the speech processing device of the above embodiments perform the operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

As shown in fig. 11, a schematic structural diagram of another embodiment of a speech processing device according to an embodiment of the present application may include:

the second determination module 1101: and the method is used for determining keywords in the acquired voice information.

The second search module 1102: and the target client corresponding to the keyword is searched from the candidate clients.

The second processing module 1103: and the recommendation content is used for acquiring the recommendation content determined by the target client based on the voice information so as to output the recommendation content for a user.

As an embodiment, the second determining module includes:

the first conversion unit is used for converting the voice information into text information;

the first processing unit is used for carrying out semantic recognition processing on the text information to obtain keywords in the text information.

As an embodiment, the second search module includes:

a sixth determining unit, configured to determine system keywords of each of the plurality of candidate clients;

The first matching unit is used for searching target system keywords matched with the keywords from the system keywords of each of the plurality of candidate clients;

and the word determining unit is used for determining the candidate client corresponding to the target system keyword as the target client.

In certain embodiments, the sixth determination unit comprises:

and the third determining subunit is used for determining the client names of the candidate clients as the system keywords of the candidate clients.

In some embodiments, the sixth determining unit may include:

a fourth determining subunit configured to determine at least one candidate word;

a fourth obtaining subunit, configured to sequentially determine candidate clients corresponding to the at least one candidate word respectively, and obtain target candidate words corresponding to the plurality of candidate clients respectively;

and a fifth obtaining subunit, configured to confirm that a target candidate word corresponding to any one candidate client is a system keyword of the candidate client, and obtain the system keywords of each of the plurality of candidate clients.

As a possible implementation manner, the fourth obtaining subunit may include:

And the first matching module is used for sequentially determining the candidate clients respectively corresponding to the at least one candidate word according to the functional attributes respectively corresponding to the plurality of candidate clients to obtain target candidate words respectively corresponding to the plurality of candidate clients.

As yet another possible implementation manner, the fourth determining subunit may include:

the second acquisition module is used for acquiring the history keywords corresponding to the history voice information;

the second selection module is used for determining target historical keywords meeting selection conditions from the historical keywords;

and the second target module is used for determining the target historical keywords as the at least one candidate word.

In some embodiments, the second selection module may include:

the second statistical unit is used for determining the occurrence times of each historical keyword;

the second extraction module is used for extracting voice attribute information in the voice information;

a seventh determining module, configured to determine user attribute information of the user based on the voice attribute information;

And the candidate determination module is used for determining at least one candidate word with an association relation with the user attribute information.

Further optionally, the first matching module may include:

and the second matching unit is used for determining target candidate words matched with the functional attributes from the at least one candidate word aiming at the functional attributes of any candidate client so as to obtain target candidate words respectively corresponding to the plurality of candidate clients.

and the second detection subunit is used for detecting the system keywords set by the user for any candidate client so as to obtain the system keywords of each of the plurality of candidate clients.

As an embodiment, the apparatus may further include:

an eighth determining module, configured to determine voice attribute information of the voice information;

the second processing module may include:

the content acquisition unit is used for acquiring recommended content determined by the target client based on the voice information;

and a third output unit configured to output the recommended content based on the voice attribute information.

As a possible implementation manner, the third output unit may include:

A second conversion subunit, configured to convert the recommended content into a first recommended voice corresponding to the voice attribute information;

a fourth output subunit for outputting the first recommended voice

Further optionally, the second converting subunit may specifically be configured to:

In some embodiments, the apparatus may further comprise:

and a ninth determining module, configured to determine user attribute information of the user based on the voice attribute information of the voice information.

And a tenth determining module, configured to determine a plurality of candidate clients that have an association relationship with the user attribute information.

As an embodiment, the apparatus may further include:

the gender determination module is used for determining gender attribute information of the voice information;

the second processing module may include:

the content acquisition four units are used for acquiring recommended content determined by the target client based on the voice information;

and a fifth output unit configured to output the recommended content for the user based on the sex attribute information.

As one possible implementation manner, the gender determining two modules may include:

and the gender determining unit is used for determining gender attribute information of the voice information based on the characteristic information of the voice information.

As yet another possible implementation manner, the gender determination two modules may include:

and a tenth determination unit configured to determine sex attribute information of the voice information using the name keyword.

Further, optionally, the tenth determining unit may include:

a third matching subunit, configured to determine, if the name keyword matches the first type name, that sex attribute information of the voice information is first attribute information;

and the fourth matching subunit is used for determining that the sex attribute information of the voice information is the second attribute information if the name keyword is matched with the second class of names.

As yet another possible implementation manner, the fifth output unit may include:

A third generation subunit, configured to generate a second recommended voice corresponding to the recommended content, with the gender attribute information as a sound generation parameter;

and the third output subunit is used for outputting the second recommended voice.

As an embodiment, the second determining module may include:

the second acquisition unit is used for acquiring the voice information of the user;

and the eleventh determining unit is used for determining keywords in the acquired voice information.

As yet another embodiment, the second determining module may include:

the second acquisition unit is used for acquiring voice information sent by the electronic equipment; the voice information is acquired by the electronic equipment;

a twelfth determining unit for determining keywords in the received and obtained voice information.

In some possible designs, the apparatus may further comprise:

and the second sending module is used for sending the voice information to the target server.

In some embodiments, the fourth output unit may include:

a fourth generation subunit, configured to generate a second recommended voice corresponding to the recommended content by using the gender attribute information as a sound generation parameter;

And the fourth output subunit is used for outputting the second recommended voice.

In some embodiments, the fourth output unit may further include:

a fifth generation subunit, configured to generate a third recommended voice corresponding to the recommended content by using the voice attribute information and the gender attribute information as sound generation parameters;

and a fifth output subunit, configured to output the third recommended voice.

The speech processing device shown in fig. 11 may perform the speech processing method described in the embodiments shown in fig. 7 to 8, and its implementation principle and technical effects are not repeated. The specific manner in which the individual modules, units, and sub-units of the speech processing apparatus in the above embodiments perform the operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

The speech processing apparatus shown in fig. 11 may be configured as a speech processing device, as shown in fig. 12, which is a schematic structural diagram of an embodiment of a speech processing device according to an embodiment of the present application, including: a storage component 1201 and a processing component 1202; the storage component 1201 stores one or more computer instructions that are invoked by the processing component 1202;

The processing component 1202 is configured to:

The electronic device may integrate multiple voice clients. The electronic device may provide voice interaction services to a user through a plurality of voice clients.

converting the voice information into text information;

As a possible implementation manner, the processing component searches the target clients corresponding to the keywords from the candidate clients may specifically be:

determining system keywords of each of the plurality of candidate clients;

In some embodiments, the processing component determines that the system keywords for each of the plurality of candidate clients may specifically be:

In some embodiments, the processing component may determine the respective system keywords for the plurality of candidate clients by:

determining at least one candidate word;

As a possible implementation manner, the processing component sequentially determines candidate clients corresponding to the at least one candidate word respectively, and the obtaining target candidate words corresponding to the plurality of candidate clients respectively may specifically be:

Further, optionally, the processing component sequentially determines, according to the functional attributes respectively corresponding to the plurality of candidate clients, candidate clients respectively corresponding to the at least one candidate word, and the obtaining, by the processing component, the target candidate word respectively corresponding to the plurality of candidate clients may specifically be:

As an embodiment, the processing component may also be configured to:

determining voice attribute information of the voice information;

the processing component obtains the recommended content determined by the target client based on the voice information, and the outputting of the recommended content for the user may specifically be:

As a possible implementation, the processing component may be further configured to:

the processing component may determine a plurality of candidate clients by:

and outputting the first recommended voice.

determining sex attribute information of the voice information;

In some embodiments, the determining, by the processing component, sex attribute information of the voice information may specifically be:

Converting the voice information into text information;

Further, optionally, the determining, by the processing component, the gender attribute information of the voice information using the name keyword may specifically be:

In some embodiments, the outputting, by the processing component, the recommended content for the user based on the gender attribute information may specifically be:

and outputting the second recommended voice.

As an embodiment, the processing component may also be configured to:

The speech processing device shown in fig. 12 may perform the speech processing method described in the embodiments shown in fig. 7 to 8, and its implementation principle and technical effects are not repeated. The specific manner in which the processing components of the speech processing device of the above embodiments perform the operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Fig. 13 is a schematic structural diagram of another embodiment of a speech processing apparatus according to an embodiment of the present application, configured in an electronic device, where the apparatus may include:

voice acquisition module 1301: the voice information acquisition device is used for acquiring voice information of a user.

The voice transmission module 1302: and the voice information is used for sending the voice information to a central server.

Third processing module 1303: and the central server is used for acquiring the recommended content fed back by the central server so as to output the recommended content for the user.

In the embodiment of the application, the voice information sent by the electronic equipment is obtained, and after the keywords in the voice information are determined, the target server corresponding to the keywords is searched from a plurality of candidate servers, so that recommended content corresponding to the voice information is obtained from the target server. The electronic equipment can acquire the network services provided by a plurality of candidate service ends through collecting the voice information of the user and through data transmission with the center service end, has no sense on the processing and switching process of the network services, expands the application range of the electronic equipment under the condition of not influencing the use of the user, and improves the utilization rate of the electronic equipment.

The speech processing device shown in fig. 13 may perform the speech processing method described in the embodiment shown in fig. 5, and its implementation principle and technical effects are not repeated. The specific manner in which the individual modules, units, and sub-units of the speech processing apparatus in the above embodiments perform the operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

The speech processing apparatus shown in fig. 13 may be configured as an electronic device, as shown in fig. 14, which is a schematic structural diagram of an embodiment of an electronic device provided in an embodiment of the present application, where the device may include: a storage component 1401 and a processing component 1402; the storage component 1401 stores one or more computer instructions that are invoked by the processing component 1402;

the processing component 1402 may be configured to:

The speech processing device shown in fig. 14 may perform the speech processing method described in the embodiment shown in fig. 5, and its implementation principles and technical effects are not repeated. The specific manner in which the processing components of the speech processing device of the above embodiments perform the operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, a specific working process executed by the processing component of the electronic device described above may refer to a corresponding process in the foregoing method embodiment, which is not described herein again.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A voice processing method, applied to a voice processing device, wherein a plurality of voice clients are installed on the voice processing device, and each of the plurality of voice clients is capable of providing voice services for a user, the method comprising:

determining keywords in the acquired voice information, wherein the keywords at least comprise keywords for distinguishing different voice clients;

searching target clients corresponding to the keywords for distinguishing different voice clients from the plurality of voice clients;

and acquiring recommended content determined by the target client based on the voice information, so as to output the recommended content for a user.

2. The method of claim 1, wherein the determining keywords in the collected voice information comprises:

converting the voice information into text information;

3. The method of claim 1, wherein the searching for the target client from the plurality of voice clients that corresponds to the keyword for distinguishing between different voice clients comprises:

Determining system keywords of each of the plurality of voice clients;

searching target system keywords matched with the keywords used for distinguishing different voice clients from the system keywords of each of the plurality of voice clients;

and determining the voice client corresponding to the target system keyword as the target client.

4. The method of claim 3, wherein the determining the system keywords for each of the plurality of voice clients comprises:

and determining the client names of the voice clients as system keywords of the voice clients.

5. The method of claim 3, wherein the system keywords for each of the plurality of voice clients are determined by:

determining at least one candidate word;

sequentially determining voice clients corresponding to the at least one candidate word respectively, and obtaining target candidate words corresponding to the voice clients respectively;

and confirming that the target candidate word corresponding to any voice client is a system keyword of the voice client, and obtaining the system keywords of each voice client.

6. The method of claim 5, wherein sequentially determining the voice clients to which the at least one candidate word corresponds respectively, and obtaining the target candidate words to which the plurality of voice clients correspond respectively comprises:

And sequentially determining the voice clients corresponding to the at least one candidate word according to the functional attributes corresponding to the voice clients respectively to obtain target candidate words corresponding to the voice clients respectively.

7. The method of claim 6, wherein sequentially determining the voice clients respectively corresponding to the at least one candidate word according to the functional attributes respectively corresponding to the plurality of voice clients, and obtaining the target candidate word respectively corresponding to the plurality of voice clients comprises:

and determining target candidate words matched with the functional attributes from the at least one candidate word aiming at the functional attributes of any voice client so as to obtain target candidate words respectively corresponding to the voice clients.

8. The method of claim 3, wherein the system keywords for each of the plurality of voice clients are determined by:

and detecting system keywords set by the user for any voice client to obtain the system keywords of each voice client.

9. The method as recited in claim 1, further comprising: determining voice attribute information of the voice information;

The outputting the recommended content for the user comprises: and outputting the recommended content based on the voice attribute information.

10. The method as recited in claim 9, further comprising:

the plurality of voice clients are determined by:

and determining a plurality of voice clients which have association relations with the user attribute information.

11. The method as recited in claim 1, further comprising: determining sex attribute information of the voice information;

the outputting the recommended content for the user comprises: and outputting the recommended content to the user based on the gender attribute information.

12. A speech processing apparatus, characterized in that it is applied to a speech processing device, on which a plurality of speech clients are installed, each of the plurality of speech clients being capable of providing a speech service for a user, comprising:

the second determining module is used for determining keywords in the acquired voice information, and the keywords at least comprise keywords used for distinguishing different voice clients;

the second searching module is used for searching target clients corresponding to the keywords for distinguishing different voice clients from the plurality of voice clients;

13. A voice processing apparatus having a plurality of voice clients mounted thereon, each of the plurality of voice clients being capable of providing voice services to a user, the voice processing apparatus comprising: a storage component and a processing component; the storage component stores one or more computer instructions that are invoked by the processing component;

the processing assembly is configured to:

determining keywords in the acquired voice information, wherein the keywords at least comprise keywords for distinguishing different voice clients; searching target clients corresponding to the keywords for distinguishing different voice clients from the plurality of voice clients; and acquiring recommended content determined by the target client based on the voice information, so as to output the recommended content for a user.