CN109658931B

CN109658931B - Voice interaction method, device, computer equipment and storage medium

Info

Publication number: CN109658931B
Application number: CN201811554298.5A
Authority: CN
Inventors: 黄泽浩; 章锦涛
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-12-19
Filing date: 2018-12-19
Publication date: 2024-05-10
Anticipated expiration: 2038-12-19
Also published as: CN109658931A

Abstract

The embodiment of the invention provides a voice interaction method, a voice interaction device, computer equipment and a storage medium. The method is applied to the field of voice interaction, and comprises the steps of acquiring input voice through a first terminal, and performing voice recognition processing on the input voice to obtain an input text; traversing all text combinations in a preset text library to obtain text combinations matched with the input text; obtaining a language type corresponding to a second terminal, and generating an output text according to the language type corresponding to the second terminal; and generating output voice according to the output text and the language type corresponding to the second terminal, and sending the output voice to the second terminal. By implementing the embodiment of the invention, the problem of language barrier in voice interaction can be solved, and the interestingness of the voice interaction is enriched.

Description

Voice interaction method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of computer data processing, and in particular, to a voice interaction method, a device, a computer device, and a computer readable storage medium.

Background

With the development of internet technology, online games are also becoming more popular. In online games, gamers often need to interact with other players, such as text to text via text entry boxes or voice via real-time voice calls. In some large network games, however, players may be distributed throughout various countries or regions of the world. Players in different countries or regions cannot understand text information or input voice of the other party due to language failure, so that communication functions in game application are similar to dummy functions.

Disclosure of Invention

The embodiment of the invention provides a voice interaction method, a voice interaction device, computer equipment and a storage medium, and aims to solve the problem of cross-language communication in voice interaction.

In a first aspect, an embodiment of the present invention provides a voice interaction method, including: acquiring input voice through a first terminal, and performing voice recognition processing on the input voice to obtain an input text; traversing all text combinations in a preset text library to obtain text combinations matched with the input text; obtaining a language type corresponding to a second terminal, and generating an output text according to the language type corresponding to the second terminal; and generating output voice according to the output text and the language type corresponding to the second terminal, and sending the output voice to the second terminal.

In a second aspect, an embodiment of the present invention provides a voice interaction device, including:

the first acquisition unit is used for acquiring input voice through the first terminal and performing voice recognition processing on the input voice to obtain an input text;

The second acquisition unit is used for traversing all text combinations in a preset text library and acquiring text combinations matched with the input text;

The third acquisition unit is used for acquiring the language type corresponding to the second terminal and generating an output text according to the language type corresponding to the second terminal;

The first generation unit is used for generating output voice according to the output text and the language type corresponding to the second terminal, and sending the output voice to the second terminal.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the voice interaction method described above when executing the program.

In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium, where the computer readable storage medium stores a computer program, the computer program including program instructions that, when executed by a processor, cause the processor to perform the above-described voice interaction method.

The embodiment of the invention provides a voice interaction method, a voice interaction device, computer equipment and a computer readable storage medium. The method comprises the steps of obtaining input voice through a first terminal, and performing voice recognition processing on the input voice to obtain an input text; traversing all text combinations in a preset text library to obtain text combinations matched with the input text; obtaining a language type corresponding to a second terminal, and generating an output text according to the language type corresponding to the second terminal; and generating output voice according to the output text and the language type corresponding to the second terminal, and sending the output voice to the second terminal. By implementing the embodiment of the invention, the problem of language barrier in voice interaction can be solved, and the interestingness of the voice interaction is enriched.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a voice interaction method according to an embodiment of the present invention;

Fig. 2 is a schematic application scenario diagram of a voice interaction method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a voice interaction method according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a voice interaction method according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a voice interaction method according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating a voice interaction method according to an embodiment of the present invention;

FIG. 7 is a schematic block diagram of a voice interaction device according to an embodiment of the present invention;

FIG. 8 is another schematic block diagram of a voice interaction device according to an embodiment of the present invention;

FIG. 9 is another schematic block diagram of a voice interaction device according to an embodiment of the present invention;

FIG. 10 is another schematic block diagram of a voice interaction device according to an embodiment of the present invention;

FIG. 11 is another schematic block diagram of a voice interaction device according to an embodiment of the present invention;

fig. 12 is a schematic block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

Fig. 1 and fig. 2 are a schematic flow chart and an application scenario diagram of a voice interaction method according to an embodiment of the invention. As shown in fig. 2, the voice interaction method is applied to the server 20, and the server 20 may be an independent server or a server cluster formed by a plurality of servers. The server 20 may be communicatively connected to the terminal 10 via network communication to implement data interaction. The number of the terminals 10 may be plural, for example, the terminals 10 include a first terminal and a second terminal, and the first terminal and the second terminal may be connected through a server. The terminal 10 may be an electronic device having a communication function such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant, and a wearable device.

Wherein the voice interaction method includes, but is not limited to, steps S110-S150.

S110, acquiring input voice through the first terminal, and performing voice recognition processing on the input voice to obtain input text.

The voice recognition processing on the input voice to obtain the input text can be realized by calling a voice recognition tool at a server side, wherein the voice recognition tool comprises but is not limited to a voice recognition tool based on an HMM and an N-gram model: CMU Sphinx, kaldi, HTK, julius and ISIP.

In some embodiments, as shown in FIG. 3, step S110 includes steps S111-S112.

S111, judging whether the duration of the input voice is greater than a preset time threshold.

In particular, the voice interaction may be a voice interaction during a game. In order to reduce unnecessary operation actions of a user in the game process, automatic collection of collected input voice can be set, namely, voice information in the game process is automatically collected as input voice through a voice collection device of the first terminal without operation of the user. While the input speech generated by the user during the game is not all of the interactive class of input speech, the user may have a conversation with others in the real world during the game. In order to reduce the processing amount of the server side to the input voice, a time threshold is set to screen the received input voice. The preset time threshold can be set according to the processing pressure of the server, and the shorter the preset time threshold is, the smaller the processing pressure of the server is. For example, the preset time threshold is 3 seconds.

And S112, if the duration of the input voice is not greater than a preset time threshold, performing voice recognition processing on the input voice to obtain an input text.

Specifically, if the duration of the input voice is not greater than a preset time threshold, it indicates that the input voice may belong to the interactive input voice, and then the input voice is subjected to voice recognition processing to obtain an input text.

If the duration of the input voice is greater than the preset time threshold, the input voice is too long, and the input voice may not be an interactive input voice, so that the input voice is deleted, and voice recognition processing is not performed on the input voice, so that the processing pressure of a server side is reduced, and the smoothness of the game is improved.

S120, traversing all text combinations in a preset text library to obtain text combinations matched with the input text.

Specifically, the preset text library is used for storing text combinations, and the number of the text combinations can be one or more. The text combination is used for storing a plurality of interaction keywords with the same semantics and different interaction languages, and each interaction keyword corresponds to a unique interaction language.

As shown in table 1, the preset text library may include a plurality of text combinations, such as a first text combination, a second text combination, and so on. Each text combination may include a plurality of interaction keywords, such as "withdraw", "rescue me", and so forth. Each interaction keyword corresponds to a unique interaction language, such as chinese, english, etc.

Interactive language	Chinese character	English language	Japanese language	……
					First text combination	Withdrawal from	retreat	The sleeve	……
Second text combination	Rescue me	helpme	Private white-figure rescue	……
					……	……	……	……	……

TABLE 1

In some embodiments, as shown in FIG. 4, step S120 includes steps S121-S123.

S121, word segmentation processing is carried out on the input text to generate text keywords.

Specifically, word segmentation processing on the first text information may be implemented by calling jieba tools. Assuming that the input text is "quick-come me" and the text keywords obtained by word segmentation of the input text are "quick-come me" and "rescue me".

S122, traversing all text combinations in a preset text library to obtain interaction keywords identical to the text keywords.

Specifically, the text keywords are subjected to character comparison with the interactive keywords in the text combination, so that the interactive keywords identical to the text keywords are obtained.

And S123, determining the text combination where the interaction keyword identical to the text keyword is located as the text combination matched with the input text.

Specifically, assuming that the text keyword is "rescue me", by traversing all text combinations in a preset text library, it may be determined that the text combination in which the same interaction keyword as the text keyword is located is a second text combination, and further the second text combination is determined as a text combination matched with the input text.

S130, obtaining the language type corresponding to the second terminal, and generating an output text according to the language type corresponding to the second terminal.

In some embodiments, as shown in FIG. 5, step S130 includes steps S131-S133.

S131, obtaining the type of the system language currently used by the second terminal, and taking the system language currently used by the second terminal as the type of the language corresponding to the second terminal.

Specifically, the system language currently used by the second terminal may be obtained by sending a system language obtaining instruction to the second terminal, and obtaining an output result returned by the system language obtaining instruction, so as to determine the type of the system language currently used by the second terminal according to the output result. The system language acquisition instruction mainly comprises: getSystemLanguageList (), if the returned output result is "EN", determining that the system language currently used by the second terminal is english.

S132, acquiring the interactive languages with the same type as the second terminal language from a preset text library.

Specifically, each text combination may include a plurality of interactive keywords having the same semantics but different interactive languages, and interactive languages corresponding to the interactive keywords one by one. For example, the second text combination includes three interaction keywords "rescue me" with the same semantic meaning, where the three interaction keywords correspond to one interaction language respectively, for example, "rescue me" corresponds to three interaction languages of "chinese", "english" and "japanese", respectively. And comparing the obtained language type corresponding to the second terminal with the interactive languages in the text combination matched with the input text, so as to obtain the interactive languages identical to the language type of the second terminal.

S133, generating an output text according to the interactive languages with the same type as the second terminal language and a preset text library.

Specifically, if the obtained interactive language with the same type as the second terminal language is English, obtaining an interactive keyword with the interactive language being English in a text combination matched with the input text, and taking the interactive keyword as an output text.

And S140, generating output voice according to the output text and the language type corresponding to the second terminal, and sending the output voice to the second terminal.

In particular, the generation of output Speech may be achieved by TTS technology, which is an abbreviation for Text To Speech, meaning "from Text To Speech", for achieving Speech synthesis. And sending the output voice to the second terminal so as to realize cross-language voice interaction.

In some embodiments, as shown in FIG. 6, step S140 includes steps S141-S143.

S141, acquiring the fundamental tone frequency of the input voice, and judging whether the fundamental tone frequency is larger than a preset frequency threshold value.

Specifically, the pitch frequency refers to a quasi-periodic excitation pulse train generated by relaxation oscillation type vibration of vocal cords generated by airflow and glottal when people vomit. The fundamental tone frequency has a dense and inseparable relation with the length, thickness and toughness of the vocal cords, and the gender of the user corresponding to the input voice can be obtained by analyzing the fundamental tone frequency of the input voice.

S142, if the pitch frequency is larger than a preset frequency threshold, determining that the input voice is female voice input voice, generating female voice output voice according to the output text and the language type corresponding to the second terminal, and sending the female voice output voice to the second terminal.

Specifically, the pitch frequency of male voices is low, and is generally 50-200Hz; the pitch frequency of female voice is high, and is generally 180-500Hz. The preset frequency threshold may be set to a range of 180-200Hz based on the difference in pitch frequencies of male and female voices. For example, the preset frequency threshold may be set to 190Hz. If the pitch frequency is greater than 190Hz, determining that the input voice is female voice input voice, and further generating female voice output.

S143, if the pitch frequency is not greater than a preset frequency threshold, determining that the input voice is male voice input voice, generating male voice output voice according to the output text and the language type corresponding to the second terminal, and sending the male voice output voice to the second terminal.

Specifically, if the pitch frequency is not greater than 190Hz, the input speech is determined to be male input speech, thereby generating male output speech.

By implementing the embodiment of the invention, the gender of the user corresponding to the input voice is obtained by analyzing the input voice, so that the gender type of the output voice is determined, the voice interaction is more realistic, and the interestingness of the voice interaction is improved.

Fig. 7 is a schematic block diagram of a voice interaction device 100 according to an embodiment of the present invention. As shown in fig. 7, the present invention further provides a voice interaction device 100 corresponding to the above voice interaction method. The voice interaction device 100 includes a unit for executing the voice interaction method, where the device 100 may be configured in a server, and the server may be an independent server or a server cluster formed by a plurality of servers.

Specifically, referring to fig. 7, the voice interaction device 100 includes a first obtaining unit 110, a second obtaining unit 120, a third obtaining unit 130, and a first generating unit 140.

The first obtaining unit 110 is configured to obtain an input voice through the first terminal, and perform a voice recognition process on the input voice to obtain an input text.

In some embodiments, as shown in fig. 8, the first obtaining unit 110 includes a first judging unit 111 and a first processing unit 112.

A first judging unit 111, configured to judge whether a duration of the input voice is greater than a preset time threshold.

The first processing unit 112 is configured to perform a speech recognition process on the input speech to obtain an input text if the duration of the input speech is not greater than a preset time threshold.

And the second obtaining unit 120 is configured to traverse all text combinations in a preset text library to obtain text combinations matched with the input text.

In some embodiments, as shown in fig. 9, the second obtaining unit 120 includes a second generating unit 121, a fourth obtaining unit 122, and a first determining unit 123.

The second generating unit 121 is configured to perform word segmentation processing on the input text to generate text keywords.

And a fourth obtaining unit 122, configured to traverse all text combinations in the preset text library, and obtain the interaction keywords that are the same as the text keywords.

A first determining unit 123, configured to determine a text combination where the same interaction keyword as the text keyword is located as a text combination matching the input text.

And the third obtaining unit 130 is configured to obtain a language type corresponding to the second terminal, and generate an output text according to the language type corresponding to the second terminal.

In some embodiments, as shown in fig. 10, the third acquiring unit 130 includes a fifth acquiring unit 131, a sixth acquiring unit 132, and a third generating unit 133.

A fifth obtaining unit 131, configured to obtain a system language type currently used by the second terminal, and use the system language currently used by the second terminal as a language type corresponding to the second terminal.

A sixth obtaining unit 132, configured to obtain, in a preset text library, an interactive language that is the same as the second terminal language.

And a third generating unit 133, configured to generate an output text according to the interactive language of the same type as the second terminal language and a preset text library.

The first generating unit 140 is configured to generate an output voice according to the output text and the language type corresponding to the second terminal, and send the output voice to the second terminal.

In some embodiments, as shown in fig. 11, the first generating unit 140 includes a seventh acquiring unit 141, a second processing unit 142, and a third processing unit 143.

A seventh obtaining unit 141, configured to obtain a pitch frequency of the input speech, and determine whether the pitch frequency is greater than a preset frequency threshold.

And the second processing unit 142 is configured to determine that the input speech is female voice input speech if the pitch frequency is greater than a preset frequency threshold, generate female voice output speech according to the output text and a language type corresponding to the second terminal, and send the female voice output speech to the second terminal.

And a third processing unit 143, configured to determine that the input speech is male input speech if the pitch frequency is not greater than a preset frequency threshold, generate male output speech according to the output text and a language type corresponding to the second terminal, and send the male output speech to the second terminal.

It should be noted that, as will be clearly understood by those skilled in the art, the specific implementation process of the voice interaction device 100 and each unit may refer to the corresponding description in the foregoing method embodiments, and for convenience and brevity of description, the description is omitted here.

The apparatus 100 described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 12.

Referring to fig. 12, fig. 12 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device 500 may be a terminal. The terminal can be electronic equipment with communication functions, such as a smart phone, a tablet personal computer, a notebook computer, a desktop computer, a personal digital assistant, a wearable equipment and the like.

The computer device 500 includes a processor 520, a memory, and a network interface 550 connected by a system bus 510, wherein the memory may include a non-volatile storage medium 530 and an internal memory 540.

The non-volatile storage medium 530 may store an operating system 531 and computer programs 532. The computer program 532, when executed, may cause the processor 520 to perform a voice interaction method.

The processor 520 is used to provide computing and control capabilities to support the operation of the overall computer device 500.

The internal memory 540 provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by the processor 520, causes the processor 520 to perform a method of voice interaction.

The network interface 550 is used for network communication with other devices. It will be appreciated by those skilled in the art that the schematic block diagram of the computer device is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device 500 to which the present inventive arrangements are applied, and that a particular computer device 500 may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

Wherein the processor 520 is configured to execute program code stored in the memory to perform the following functions: acquiring input voice through a first terminal, and performing voice recognition processing on the input voice to obtain an input text; traversing all text combinations in a preset text library to obtain text combinations matched with the input text; obtaining a language type corresponding to a second terminal, and generating an output text according to the language type corresponding to the second terminal; and generating output voice according to the output text and the language type corresponding to the second terminal, and sending the output voice to the second terminal.

In one embodiment, when the step of obtaining the input voice through the first terminal and performing the voice recognition processing on the input voice to obtain the input text is performed by the processor 520, the following steps are specifically performed: judging whether the duration of the input voice is greater than a preset time threshold; and if the duration of the input voice is not greater than a preset time threshold, performing voice recognition processing on the input voice to obtain an input text.

In one embodiment, when the processor 520 performs the step of traversing all text combinations in the preset text library to obtain a text combination matching the input text, the following steps are specifically performed: word segmentation processing is carried out on the input text so as to generate text keywords; traversing all text combinations in a preset text library to obtain interaction keywords which are the same as the text keywords; and determining the text combination where the interaction keyword identical to the text keyword is located as the text combination matched with the input text.

In one embodiment, when executing the step of obtaining the language type corresponding to the second terminal and generating the output text according to the language type corresponding to the second terminal, the processor 520 specifically executes the following steps: acquiring a system language type currently used by a second terminal, and taking the system language currently used by the second terminal as a language type corresponding to the second terminal; acquiring interactive languages with the same type as the second terminal language from a preset text library; and generating an output text according to the interactive languages with the same type as the second terminal language and a preset text library.

In one embodiment, when executing the step of generating an output voice according to the output text and the language type corresponding to the second terminal and sending the output voice to the second terminal, the processor 520 specifically executes the following steps: acquiring the fundamental tone frequency of the input voice, and judging whether the fundamental tone frequency is larger than a preset frequency threshold value; if the pitch frequency is larger than a preset frequency threshold, determining that the input voice is female voice input voice, generating female voice output voice according to the output text and the language type corresponding to the second terminal, and sending the female voice output voice to the second terminal; if the pitch frequency is not greater than a preset frequency threshold, determining that the input voice is male voice input voice, generating male voice output voice according to the output text and the language type corresponding to the second terminal, and sending the male voice output voice to the second terminal.

It should be appreciated that in embodiments of the present invention, processor 520 may be a central processing unit (Central Processing Unit, CPU), and that Processor 520 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will be appreciated by those skilled in the art that the schematic block diagram of the computer device 500 does not constitute a limitation of the computer device 500, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

In another embodiment of the present invention, a computer-readable storage medium storing a computer program is provided, wherein the computer program includes program instructions. The program instructions, when executed by a processor, implement the steps of: acquiring input voice through a first terminal, and performing voice recognition processing on the input voice to obtain an input text; traversing all text combinations in a preset text library to obtain text combinations matched with the input text; obtaining a language type corresponding to a second terminal, and generating an output text according to the language type corresponding to the second terminal; and generating output voice according to the output text and the language type corresponding to the second terminal, and sending the output voice to the second terminal.

In one embodiment, when the program instructions are executed by the processor to implement the step of obtaining an input voice through the first terminal and performing a voice recognition process on the input voice to obtain an input text, the following steps are specifically implemented: judging whether the duration of the input voice is greater than a preset time threshold; and if the duration of the input voice is not greater than a preset time threshold, performing voice recognition processing on the input voice to obtain an input text.

In an embodiment, when the program instructions are executed by the processor to implement the step of traversing all text combinations in the preset text library to obtain a text combination matching the input text, the following steps are specifically implemented: word segmentation processing is carried out on the input text so as to generate text keywords; traversing all text combinations in a preset text library to obtain interaction keywords which are the same as the text keywords; and determining the text combination where the interaction keyword identical to the text keyword is located as the text combination matched with the input text.

In an embodiment, when the program instructions are executed by the processor to implement the step of obtaining the language type corresponding to the second terminal and generating the output text according to the language type corresponding to the second terminal, the following steps are specifically implemented: acquiring a system language type currently used by a second terminal, and taking the system language currently used by the second terminal as a language type corresponding to the second terminal; acquiring interactive languages with the same type as the second terminal language from a preset text library; and generating an output text according to the interactive languages with the same type as the second terminal language and a preset text library.

In an embodiment, when the program instructions are executed by the processor to implement the step of generating output voice according to the output text and the language type corresponding to the second terminal, and sending the output voice to the second terminal, the following steps are specifically implemented: acquiring the fundamental tone frequency of the input voice, and judging whether the fundamental tone frequency is larger than a preset frequency threshold value; if the pitch frequency is larger than a preset frequency threshold, determining that the input voice is female voice input voice, generating female voice output voice according to the output text and the language type corresponding to the second terminal, and sending the female voice output voice to the second terminal; if the pitch frequency is not greater than a preset frequency threshold, determining that the input voice is male voice input voice, generating male voice output voice according to the output text and the language type corresponding to the second terminal, and sending the male voice output voice to the second terminal.

The computer readable storage medium may be a usb disk, a removable hard disk, a Read-only memory (ROM), a magnetic disk, or an optical disk, etc. which may store the program code.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and units described above may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, more than one unit or component may be combined or may be integrated into another system, or some features may be omitted, or not performed.

The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be combined, divided and deleted according to actual needs.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A method of voice interaction, the method comprising:

automatically acquiring input voice through a first terminal, and judging whether the duration of the input voice is greater than a preset time threshold; if the duration of the input voice is not greater than a preset time threshold, indicating that the input voice belongs to the interactive input voice, performing voice recognition processing on the input voice to obtain an input text; if the duration of the input voice is greater than a preset time threshold, indicating that the input voice is too long, deleting the input voice if the input voice is not an interactive input voice, and not performing voice recognition processing on the input voice;

Traversing all text combinations in a preset text library to obtain text combinations matched with the input text, wherein the text combinations are used for storing a plurality of interaction keywords with the same semantic meaning but different interaction languages, and each interaction keyword corresponds to a unique interaction language;

Obtaining a language type corresponding to a second terminal, and obtaining a corresponding interaction keyword from a text combination matched with the input text according to the language type corresponding to the second terminal to generate an output text;

Acquiring the fundamental tone frequency of the input voice, and judging whether the fundamental tone frequency is larger than a preset frequency threshold value; if the pitch frequency is larger than a preset frequency threshold, determining that the input voice is female voice input voice, generating female voice output voice according to the output text and the language type corresponding to the second terminal, and sending the female voice output voice to the second terminal; if the pitch frequency is not greater than a preset frequency threshold, determining that the input voice is male voice input voice, generating male voice output voice according to the output text and the language type corresponding to the second terminal, and sending the male voice output voice to the second terminal.

2. The method of claim 1, wherein traversing all text combinations in a pre-set text library to obtain text combinations that match the input text comprises:

word segmentation processing is carried out on the input text so as to generate text keywords;

Traversing all text combinations in a preset text library to obtain interaction keywords which are the same as the text keywords;

and determining the text combination where the interaction keyword identical to the text keyword is located as the text combination matched with the input text.

3. The method of claim 1, wherein the obtaining the language type corresponding to the second terminal and generating the output text according to the language type corresponding to the second terminal comprises:

Acquiring a system language type currently used by a second terminal, and taking the system language currently used by the second terminal as a language type corresponding to the second terminal;

Acquiring interactive languages with the same type as the second terminal language from a preset text library;

and generating an output text according to the interactive languages with the same type as the second terminal language and a preset text library.

4. A voice interaction device, the device comprising:

the first acquisition unit is used for automatically acquiring input voice through the first terminal and judging whether the duration of the input voice is greater than a preset time threshold; if the duration of the input voice is not greater than a preset time threshold, indicating that the input voice belongs to the interactive input voice, performing voice recognition processing on the input voice to obtain an input text; if the duration of the input voice is greater than a preset time threshold, indicating that the input voice is too long, deleting the input voice if the input voice is not an interactive input voice, and not performing voice recognition processing on the input voice;

The second obtaining unit is used for traversing all text combinations in a preset text library to obtain text combinations matched with the input text, wherein the text combinations are used for storing a plurality of interaction keywords with the same semantics and different interaction languages, and each interaction keyword corresponds to a unique interaction language;

the third acquisition unit is used for acquiring the language type corresponding to the second terminal, and acquiring the corresponding interaction keyword from the text combination matched with the input text according to the language type corresponding to the second terminal to generate an output text;

The first generation unit is used for acquiring the fundamental tone frequency of the input voice and judging whether the fundamental tone frequency is larger than a preset frequency threshold value or not; if the pitch frequency is larger than a preset frequency threshold, determining that the input voice is female voice input voice, generating female voice output voice according to the output text and the language type corresponding to the second terminal, and sending the female voice output voice to the second terminal; if the pitch frequency is not greater than a preset frequency threshold, determining that the input voice is male voice input voice, generating male voice output voice according to the output text and the language type corresponding to the second terminal, and sending the male voice output voice to the second terminal.

5. The apparatus of claim 4, wherein the second acquisition unit comprises:

the second generation unit is used for carrying out word segmentation processing on the input text so as to generate text keywords;

A fourth obtaining unit, configured to traverse all text combinations in a preset text library, and obtain interaction keywords that are the same as the text keywords;

and the first determining unit is used for determining the text combination where the interaction keyword identical to the text keyword is located as the text combination matched with the input text.

6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the voice interaction method according to any of claims 1 to 3 when executing the program.

7. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to perform the voice interaction method according to any of claims 1-3.