CN111312212A

CN111312212A - Voice processing method, device and medium

Info

Publication number: CN111312212A
Application number: CN202010117629.XA
Authority: CN
Inventors: 崔文华; 张丹; 宋金昌; 罗大为
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2020-06-19

Abstract

The embodiment of the invention provides a voice processing method, a voice processing device and a voice processing medium, wherein the method specifically comprises the following steps: collecting voice signals of a sound source through a microphone array; determining a target language corresponding to the voice signal according to the spatial information of the sound source relative to the microphone array; determining that the speech signal corresponds to a translation result of the target language; and outputting the translation result. The embodiment of the invention can realize continuous cross-language dialogue without depending on manual trigger operation, thereby improving the efficiency and the fluency of the cross-language dialogue.

Description

Voice processing method, device and medium

Technical Field

The present invention relates to the field of electronic device technologies, and in particular, to a speech processing method, a speech processing apparatus, and a machine-readable medium.

Background

Voice is one of the most natural ways of communication. With the increasing number of people who go out and travel year by year, users have more and more demands on real-time speech translation in an overseas scene. For example, users often have cross-language conversation needs during outbound travel, such as basic communication needs for asking ways and ordering food with foreigners, or deep communication needs for foreigners.

Taking a cross-language conversation between a user a and a user B as an example, in a current speech real-time translation method, a key is usually provided on a translation terminal, and the user a presses the key to speak, so that the translation terminal can play a translation result a corresponding to the speech of the user a after the user a finishes speaking, so that the user B listens to the translation result a. Then, the user B presses the button to speak, and the translation terminal may play the translation result B corresponding to the speech of the user B after the speech of the user B is ended, so that the user a listens to the translation result B.

In the process of cross-language conversation, a user usually presses a key to trigger each time of speaking, and the more times of speaking, the more times of manual triggering operation; in this way, the efficiency and fluency of cross-language conversations will be affected, which in turn affects the user experience.

Disclosure of Invention

Embodiments of the present invention provide a speech processing method, a speech processing apparatus, an apparatus for speech processing, and a machine-readable medium, which can implement a persistent cross-language dialogue without relying on a manual trigger operation, and thus can improve efficiency and fluency of the cross-language dialogue.

In order to solve the above problem, an embodiment of the present invention discloses a speech processing method, including:

collecting voice signals of a sound source through a microphone array;

determining a target language corresponding to the voice signal according to the spatial information of the sound source relative to the microphone array;

determining that the speech signal corresponds to a translation result of the target language;

and outputting the translation result.

In another aspect, an embodiment of the present invention discloses a speech processing apparatus, including:

the acquisition module is used for acquiring a voice signal of a sound source through the microphone array;

the target language determining module is used for determining a target language corresponding to the voice signal according to the spatial information of the sound source relative to the microphone array;

the translation result determining module is used for determining the translation result of the voice signal corresponding to the target language; and

and the translation result output module is used for outputting the translation result.

In yet another aspect, an embodiment of the present invention discloses an apparatus for speech processing, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by the one or more processors includes instructions for:

collecting voice signals of a sound source through a microphone array;

and outputting the translation result.

In yet another aspect, embodiments of the invention disclose one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform one or more of the speech processing methods described above.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, the sound source can occupy the corresponding sound pickup angle in the microphone array, so that the target language corresponding to one-time speaking can be determined according to the spatial information of the sound source relative to the microphone array, and the translation result can be automatically provided after the one-time speaking is finished. Therefore, the embodiment of the invention can realize continuous cross-language dialogue without depending on manual trigger operation, and further can improve the efficiency and the fluency of the cross-language dialogue.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of a data processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of the steps of one embodiment of a speech processing method of the present invention;

FIG. 3 is a schematic of an interface according to an embodiment of the present invention;

FIG. 4(a) is a schematic of an interface in an initial state according to an embodiment of the present invention;

FIG. 4(b) is an illustration of a state interface in pickup according to an embodiment of the present invention;

FIG. 4(c) is an illustration of an interface for translating a playback status of a result, in accordance with an embodiment of the present invention;

FIG. 5 is a block diagram of another embodiment of a speech processing apparatus according to the present invention;

FIG. 6 is a block diagram of an apparatus 900 for speech processing of the present invention; and

fig. 7 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a voice processing scheme, which can realize continuous cross-language conversation under the condition of not depending on manual trigger operation, and further can improve the efficiency and the fluency of the cross-language conversation.

The scheme specifically comprises the following steps: collecting voice signals of a sound source through a microphone array; determining a target language corresponding to the voice signal according to the spatial information of the sound source relative to the microphone array; determining a translation result of the voice signal corresponding to the target language; and outputting the translation result.

Microphone arrays (Microphone arrays), usually consisting of a number of acoustic sensors, are used to sample and process the spatial characteristics of the sound field. The acoustic sensor described above may include: a microphone element.

A microphone array is a plurality of microphone elements distributed in space and arranged in a certain way to better acquire the spatial information of a sound source. The number of microphone elements in the microphone array may be not less than 2, for example, the number of microphone elements may be 2, 4, 6, 8, etc. Thereby, the sound pickup range can be increased.

Optionally, the microphone array in the embodiment of the present invention may be a nonlinear array, so that a sound pickup range may be increased, and a sound pickup angle exceeding 180 degrees may be realized by the nonlinear array, that is, the sound pickup angle achieved in the embodiment of the present application may range from 180 degrees to 360 degrees. Wherein, the pickup can refer to the process of collecting sound, and the pickup angle can refer to the scope of sound collection.

In addition, under the condition of saving manual trigger operation of the user, the embodiment of the invention can help the user focus on the conversation situation, thereby improving the naturalness of the conversation.

The embodiment of the invention can be applied to the scenes of cross-language conversation such as personal communication, interview, business communication and the like.

The embodiment of the invention can collect the voice signal of the sound source through the microphone array. It should be noted that, in the embodiment of the present invention, the sound signal may be collected by the microphone array, and VAD (voice activity Detection) is performed on the sound signal to obtain the voice signal of the sound source from the sound signal.

The purpose of the VAD is to identify and eliminate long periods of silence from the sound signal. In an embodiment of the present invention, the VAD result may include: speech signals or non-speech signals, wherein the speech signals may be further processed and the non-speech signals may be discarded. It is understood that the VAD can determine the start point and the end point of the speech signal.

The VAD can accurately detect valid speech signals and invalid non-speech signals (e.g., silence and/or noise, etc.) under stationary or non-stationary noise. Alternatively, when the period of silence exceeds a preset period, a pause of the voice signal may be considered to occur, and thus the end point of the voice signal may be determined.

Referring to fig. 1, an example of an application scenario of an embodiment of a speech processing method of the present invention is shown, the example involving a cross-language dialog between a user a and a user B, the user a using a first language and the user B using a second language.

User a and user B may use translation terminal 100 to conduct a cross-language conversation. The translation terminal 100 may be internally or externally provided with a microphone array, and the microphone array may include: at least one first microphone element 101 and at least one second microphone element 102. In fig. 1, schematic layouts of a first microphone element 101 and a second microphone element 102 in an translation terminal 100 are shown with dashed line graphs, respectively.

The layout of the first microphone element 101 and the second microphone element 102 may be determined by a person skilled in the art according to the actual application requirements.

According to an embodiment, the first microphone element 101 and the second microphone element 102 may be arranged opposite to each other. For example, the first microphone element 101 is located at the top of the translation terminal 100, the second microphone element 102 is located at the bottom of the translation terminal 100, and so on.

According to another embodiment, the first microphone element 101 may be located at the top of the translation terminal 100, and the second microphone element 102 may be located at the side of the translation terminal 100, for example, the two sides of the translation terminal 100 are respectively provided with the second microphone elements 102.

The number and properties of the first microphone element 101 and the second microphone element 102 may be determined by a person skilled in the art according to the actual application requirements.

As in fig. 1, two first microphone elements 101 are located on top of the translation terminal 100. The first microphone element 101 may be a directional microphone. The directional microphone can pick up sound towards the pointed direction and shield the sound in other directions. For example, the directional microphone is pointed to the user a, and the directional microphone can collect the voice of the user a to improve the accuracy of voice recognition.

As in fig. 2, three second microphone elements 102 are located at both side portions of the translation terminal 100, respectively. The second microphone element 102 may be an omni-directional microphone. An omni-directional microphone is capable of picking up sound in a direction of 360 degrees, i.e., it is capable of receiving sound in a direction of 360 degrees.

It is to be understood that the layout and properties of the first microphone element 101 and the second microphone element 102 in fig. 1 are only used as an alternative embodiment, and in fact, the layout and properties of the first microphone element 101 and the second microphone element 102 are not limited by the embodiment of the present invention, for example, the first microphone element 101 and the second microphone element 102 are both directional microphones or omnidirectional microphones, etc.

Translation terminal 100 may be disposed between user a and user B. For example, the first microphone element 101 is directed towards user a, the second microphone element 102 is directed towards user B; in this case, the first microphone element 101 may collect the speech signal of user a and the second microphone element 102 may direct the speech signal of user B. Of course, the first microphone element 101 may be directed towards user B and the second microphone element 102 may be directed towards user a.

In the conventional art, different users share a microphone array, and specifically, different users use the microphone array in turn. A user manually initiates a utterance and specifies a target language corresponding to the utterance.

In the embodiment of the invention, the sound source can occupy the corresponding sound pickup angle in the microphone array, so that the target language corresponding to one-time speaking can be determined according to the spatial information of the sound source relative to the microphone array. In other words, the embodiment of the invention can determine which user uttered a speech, and can determine which target language the speech is to be translated into, and therefore, can save manual trigger operation of the user.

The embodiment of the invention can continuously collect the voice signals of the sound source through the microphone array. For any speech of a cross-language conversation, the embodiment of the invention may determine a target language corresponding to the speech signal according to spatial information of the sound source relative to the microphone array, determine a translation result of the speech signal corresponding to the target language, and output the translation result after the speech signal is ended.

After the translation result is output, the embodiment of the present invention may continue to detect the next utterance, and specifically, may continue to determine the target language corresponding to the next utterance. Therefore, the embodiment of the invention can realize continuous cross-language dialogue under the condition of saving manual operation of a user.

The voice processing method provided by the embodiment of the invention can be applied to application environments corresponding to the client and the server, wherein the client and the server are positioned in a wired or wireless network, and the client and the server perform data interaction through the wired or wireless network.

Optionally, the client may run on a terminal, and the terminal specifically includes but is not limited to: smart phones, tablet computers, electronic book readers, MP3 (Moving Picture experts Group Audio Layer III) players, MP4 (Moving Picture experts Group Audio Layer IV) players, laptop portable computers, car-mounted computers, desktop computers, set-top boxes, smart televisions, wearable devices, translation terminals, and the like. Alternatively, the client may correspond to any application program, such as a speech translation program.

Method embodiment

Referring to fig. 2, a flowchart illustrating steps of an embodiment of a speech processing method according to the present invention is shown, which may specifically include:

step 201, collecting a voice signal of a sound source through a microphone array;

step 202, determining a target language corresponding to the voice signal according to the spatial information of the sound source relative to the microphone array;

step 203, determining a translation result of the voice signal corresponding to the target language;

and step 204, outputting the translation result.

Although the embodiment of the method shown in fig. 2 may be executed by a client or a server, the embodiment of the present invention is not limited to a specific execution subject corresponding to the embodiment of the method.

In step 201, a microphone array may be coupled to a terminal.

The coupling may specifically include: a contact connection or a non-contact connection between the microphone array and the terminal. The contact connection may include: the data line connection, for example, the microphone array may be connected to the mobile device via a pluggable USB (Universal Serial Bus). Alternatively, the microphone array may be a component of the terminal, i.e. the microphone array may be integrated inside the terminal. The non-contact connection may include: a WIFI (wireless fidelity) connection, a bluetooth connection, etc. It is understood that the embodiment of the present invention does not impose a limitation on the specific connection relationship between the microphone array and the terminal.

The microphone array may have various microphone elements, such as a uniform circular array, a uniform polygonal array, a non-uniform circular array, a non-uniform polygonal array, etc. The specific arrangement rule of the microphone array is not limited by the embodiment of the invention.

According to an embodiment, the microphone array may be a uniform microphone array in which the distances between the microphone elements are the same. According to an embodiment, the microphone array may be a non-uniform microphone array in which the distances between the microphone elements are different.

In an optional embodiment of the present application, the microphone array may include: n microphone elements, N being an even number greater than 2. N is even number, so that the microphone array has symmetry, and the processing of the voice signal can be facilitated.

In another alternative embodiment of the present application, the microphone array includes N microphone elements located around a predetermined central point or a predetermined center line, such that the N microphone elements are arranged in a closed pattern, such as a circle, an ellipse, a polygon, and the like.

In this embodiment of the present invention, optionally, the arrangement information of the microphone array may be determined according to the number P of supported sound sources (P is a natural number greater than 1).

Assuming that the number of supported sound sources is 2, microphone arrays may be provided at opposite sides of the terminal. Assuming that the number of supported sound sources is 3, microphone arrays may be provided on three sides of the terminal. Assuming that the number of supported sound sources is 4, microphone arrays may be disposed on four sides of the terminal. Of course, the number of sound sources in the embodiment of the present invention may be greater than 4, and for example, a microphone array corresponding to two sound sources may be disposed on one side.

In step 202, spatial information of the sound source with respect to the microphone array may be obtained by a sound source direction finding method. Spatial information of a sound source relative to the microphone array can be expressed by spatial information characteristics such as azimuth angle, pitch angle, distance and the like. Alternatively, the sound source direction finding method may include: a TDOA (Time Difference of arrival) based method, which is implemented by the following principle: and calculating the time delay information from the voice signal to different microphone elements in the microphone array by using a generalized cross-correlation equal time delay estimation method, and estimating the spatial information of the sound source by using the time delay information and the spatial distribution relation of the microphone array. Of course, the TDOA-based method is only an example, and actually, a person skilled in the art may also use a sound source direction-finding method such as a controllable beam forming method based on the maximum output power according to the actual application requirement, and the embodiment of the present application does not limit the specific sound source direction-finding method.

Spatial information of a sound source relative to the microphone array can be used for determining a target language, and therefore output of a translation result can be achieved.

In an embodiment of the present invention, optionally, the determining the target language corresponding to the voice signal may specifically include: and searching a mapping relation between the spatial information and a target language according to the spatial information of the sound source relative to the microphone array so as to obtain the target language corresponding to the voice signal.

The spatial information may include: the identifier of the microphone element, such as a numerical identifier, a color identifier, an orientation identifier, etc., may be displayed on the surface of the terminal, or may be displayed on the interface.

Alternatively, the spatial information may be associated with the microphone elements. For example, a first microphone element corresponds to a sound pickup angle range 1, a second microphone element corresponds to a sound pickup angle range 2, a third microphone array element corresponds to a sound pickup angle range 3, and so on. Therefore, the microphone element identification corresponding to the sound source can be determined according to the pickup angle range corresponding to the spatial information.

The mapping relationship between the spatial information and the target language can be obtained according to the setting of the user. For example, a setting interface of the spatial information corresponding to the target language may be provided. For example, in case a user of a cross-language dialog intends to utilize certain microphone elements, a mapping between the identification of the microphone elements and the target language may be set through the above-described setting interface. For example, a user who has chinese as a native language can set the target language to english, japanese, or the like.

For example, the terminal is respectively provided with a first microphone element and a second microphone element on two opposite sides, and in the process of picking up sound by using the microphone arrays, which target language any one of the first microphone element and the second microphone element corresponds to can be determined. For example, the target language corresponding to the first microphone element is english, etc.

It should be noted that, in the cross-language dialog scenario, if the category of the participating language is 2, the participating parties may include: a first participant and a second participant, the participant language of one participant may be the target language of the other participant. For example, a first participant is expressed in chinese, the target language of the first participant is english, a second participant is expressed in english, the target language of the second participant is chinese, and so on. Therefore, embodiments of the present invention may determine a target language for any of the first microphone element and the second microphone element. Of course, the embodiment of the present invention may determine the corresponding target languages for the first microphone element and the second microphone element, respectively.

Alternatively, the mapping relationship between the spatial information and the target language may be obtained according to the location information or language usage information of the user.

The embodiment of the invention can determine the target language in the mapping relation according to the language use information of the local user. The foreign language of the local user can be determined according to the language use information of the local user, and thus the target language can be determined according to the foreign language.

Or, the embodiment of the present invention may determine and determine the target language in the mapping relationship according to the location information of the local user. For example, when the location information of the local user indicates that the local user is out of the country, the target language may be obtained according to the language corresponding to the location information. Alternatively, whether the local user is out of bound can be judged by matching the position information with a preset position range. For example, if the predetermined location range is china and the user arrives outside europe, america, australia, or the like, the language of the location information local to the user may be used as the target language.

In step 203, the speech signal may be first converted into a text, and then the text is translated into a target language, so that a translation result of the speech signal corresponding to the target language may be obtained.

The embodiment of the application can adopt the voice recognition technology to convert the voice signal into the text. If the speech signal is marked as S, a series of processing is carried out on S to obtain a corresponding speech characteristic sequence O, and the sequence O is marked as { O ═ O₁，O₂，…，O_i，…，O_TIn which O is_iIs the ith speech feature, and T is the total number of speech features. A sentence corresponding to a speech signal S can be regarded as a word string composed of many words, and is denoted by W ═ W₁，w₂，…，w_n}. The process of speech recognition is to find the most likely word string W based on the known speech feature sequence O.

Specifically, the speech recognition is a model matching process, in which a speech model is first established according to speech characteristics, and a template required for the speech recognition is established by extracting required features through analysis of an input speech signal; the process of recognizing the voice input by the user is a process of comparing the characteristics of the voice input by the user with the template, and finally determining the best template matched with the voice input by the user so as to obtain a voice recognition result. The specific speech recognition algorithm may adopt a training and recognition algorithm based on a statistical hidden markov model, or may adopt other algorithms such as a training and recognition algorithm based on a neural network, a recognition algorithm based on dynamic time warping matching, and the like.

The embodiment of the present invention may translate the text into the translation result of the target language by using a machine translation technology, and the embodiment of the present invention does not limit the specific machine translation technology.

It should be noted that, the embodiment of the present invention does not limit the specific time for determining the translation result of the speech signal corresponding to the target language. For example, it may be determined that the speech signal corresponds to a translation result of the target language if the length of the speech signal exceeds a first length threshold; alternatively, after the speech signal is ended, it may be determined that the speech signal corresponds to the translation result of the target language.

In step 204, outputting the translation result may include: the translation result may be presented, for example, to a user using a target language. Alternatively, the translation result may be continuously displayed during the speech.

Alternatively, the outputting the translation result may include: and after the voice signal is finished, playing the translation result for the user to listen.

The speech signal end can represent the end of one-time speech, and the translation result is output after one-time speech, so that the interference of the translation result on a user can be avoided.

In this embodiment of the present invention, optionally, the end of the speech signal may be determined by:

performing voice activation detection on the voice signal; and after detecting the end point of the voice signal, judging that the voice signal is ended.

For example, during the voice activation detection, the length of the silent period may be detected, and if the length of the silent period exceeds a second length threshold, it may be considered that the end point of the voice signal is detected. Of course, the embodiment of the present invention does not impose any limitation on the specific manner of determining the end of the speech signal.

In the embodiment of the invention, at a certain moment, the voice signals can be collected by partial microphone elements in the microphone array, namely, the partial microphone elements are in a working state, and the partial microphone elements are in a leisure state. Therefore, the embodiment of the invention can close part of microphone elements in a leisure state so as to reduce the power consumption of the microphone arrays.

In an alternative embodiment of the present invention, the microphone array may include: a first microphone element and a second microphone element; the first microphone element is matched with the voice signal, namely the first microphone element is used for collecting the voice signal corresponding to the current sounding sound source;

in this case, the method may further include: switching off the second microphone element. The second microphone element can be generally pointed to the microphone element in a leisure state, and the second microphone element is closed, so that the power consumption of the second microphone element can be reduced, and the processing resources corresponding to the voice signals collected by the second microphone element can be reduced.

In another optional embodiment of the present invention, the method may further comprise: after outputting the translation result, turning on the second microphone element. After the translation result is output, the second microphone element may be turned on to collect a subsequent speech signal by the second microphone element.

It should be noted that, before determining the target microphone element corresponding to a speech, both the first microphone element and the second microphone element are in an on state. The target microphone array elements may characterize the microphone elements used for acquiring the speech signal of the sound source.

After the target microphone array element corresponding to one-time speaking is determined, the target microphone element is in an open state, and the non-target microphone element is in a closed state. For example, if the target microphone array element is a first microphone element, the non-target microphone element may be a second microphone element.

After one-time speaking is finished or the translation result of one-time speaking is output, the non-target microphone element can be started, and the first microphone element and the second microphone element are both in a starting state to prepare for collecting the voice signal of the next speaking.

For example, user a and user B may have a cross-language conversation using the translation terminal 100, with the translation terminal 100 disposed between user a and user B, with the first microphone element 101 directed toward user a, and the second microphone element 102 directed toward user B. It can be determined that the first microphone unit 101 is in an operating state and the second microphone unit 102 is in a leisure state during a certain speech of the user a, so that the second microphone unit 102 can be turned off. Moreover, after the translation result corresponding to the speech of the user a is output, the second microphone element 102 may be turned on to meet the requirement of acquiring the speech signal of the user B through the second microphone element 102.

In another optional embodiment of the present invention, the method may further comprise: displaying an interface; the interface may include: at least two regions oppositely arranged; the texts in different areas correspond to different languages; the area of the interface that matches the speech signal is marked for presentation, where the entire area that matches the speech signal may be marked for presentation to prompt the user as to which participant's speech is being taken.

The embodiment of the invention can set a plurality of areas in the interface, and different areas display texts in different languages for users in different languages to check.

Optionally, the at least two regions comprise: a first region and a second region; the first region is matched with the voice signal;

the first region displays a source language text, and the second region displays a translation result of the source language text corresponding to a target language. The source language may refer to a language to which the speech signal corresponds.

For example, in the case where the category of the participating language is 2, the interface may include: a first area and a second area arranged above and below, the first area corresponding to the first microphone element 101 for displaying chinese text, the second area corresponding to the second microphone element 102 for displaying english text, and so on.

Optionally, corresponding indicia may be displayed in the regions, and different regions may correspond to different indicia to distinguish different parties. The marker display of the embodiment of the present invention may include: color display, corner mark display, text display, etc.

Referring to fig. 3, a schematic diagram of an interface according to an embodiment of the present invention is shown, where the interface may specifically include: a first region 301 and a second region 302.

Wherein the first region 301 may include: the source language setting area 311 shows the source language in chinese in fig. 3. The source language setting area 311 may provide a drop-down language for selection by the user. The first region 301 may include: the first edge area 312, the color of the first edge area 312 may be a first color to indicate the first area 301.

The second region 302 may include: the target language setting area 321 is a target language, where the target language shown in fig. 3 is english, and the target language may represent that the target language corresponding to the translation result is english when the user is in the first area and the source language is chinese. The source language setting area 321 may provide a drop-down language for selection by the user. The second region 302 may include: the color of the second edge area 322, the second edge area 322 can be a second color to indicate the second area 302.

The text displayed in the first area 301 and the second area 302 may change as the processing state changes. For example, the processing state may include: initial state, pickup in-process state, translation result playing finishing state and the like. And the first text displayed in the first area 301 and the second text displayed in the second area 302 are in opposite directions for viewing by different parties.

Referring to fig. 4(a), a schematic of an interface in an initial state according to an embodiment of the present invention is shown, in which the first text of the first area 301 may be "say chinese …", and the second text of the second area 302 may be "Speak english.

Referring to fig. 4(b), an illustration of a state-in-pickup interface according to an embodiment of the present invention is shown, in which the entire first area 301 may be marked for display. The first text of the first area 301 may be "weather today" and the second text of the second area 302 may be "at's the weather".

Referring to fig. 4(c), an illustration of an interface for translating a playing state of a result according to an embodiment of the present invention is shown, wherein the first text of the first area 301 may be "how weather today? ", the second text of the second region 302 may be" whais the weather today? ", the second area 302 may bold or highlight the second text and may display a play control 323 corresponding to the second text. In this case, the second text and the first text may adopt different display parameters, and the display parameters may include: display brightness, display font, display size, whether to be bolded, etc.

It can be understood that after the translation result is played, the second text can be normally displayed, and in this case, the second text and the first text may use the same display parameters.

In summary, according to the voice processing method in the embodiment of the present application, the sound source may occupy a corresponding pickup angle in the microphone array, so that the target language corresponding to one utterance may be determined according to spatial information of the sound source relative to the microphone array, and then the translation result may be automatically provided after the one utterance is finished. Therefore, the embodiment of the invention can realize continuous cross-language dialogue without depending on manual trigger operation, and further can improve the efficiency and the fluency of the cross-language dialogue.

It should be noted that, for simplicity of description, the method embodiments are described as a series of motion combinations, but those skilled in the art should understand that the present invention is not limited by the described motion sequences, because some steps may be performed in other sequences or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no moving act is required as an embodiment of the invention.

Device embodiment

Referring to fig. 5, a block diagram of a speech processing apparatus according to an embodiment of the present invention is shown, which may specifically include:

the acquisition module 501 is configured to acquire a voice signal of a sound source through a microphone array;

a target language determining module 502, configured to determine a target language corresponding to the voice signal according to spatial information of the sound source relative to the microphone array;

a translation result determining module 503, configured to determine that the speech signal corresponds to a translation result of the target language;

a translation result output module 504, configured to output the translation result.

Optionally, the target language determination module 502 may include:

and the searching module is used for searching the mapping relation between the spatial information and the target language according to the spatial information of the sound source relative to the microphone array so as to obtain the target language corresponding to the voice signal.

Optionally, the translation result output module 504 is specifically configured to output the translation result after the speech signal is ended;

the above apparatus may further include:

the voice activation detection module is used for carrying out voice activation detection on the voice signal;

and the judging module is used for judging the end of the voice signal after detecting the end point of the voice signal.

Optionally, the microphone array may include: a first microphone element and a second microphone element; the first microphone element is matched with the voice signal;

the above apparatus may further include:

and the closing module is used for closing the second microphone element.

Optionally, the apparatus may further include:

and the opening module is used for opening the second microphone element after the translation result is output.

Optionally, the apparatus may further include:

the first display module is used for displaying an interface; the interface may include: at least two regions oppositely arranged; the texts in different areas correspond to different languages;

and the second display module is used for marking and displaying the area matched with the voice signal in the interface.

Optionally, the at least two regions may include: a first region and a second region; the first region is matched with the voice signal;

the first region displays a source language text, and the second region displays a translation result of the source language text corresponding to a target language.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An embodiment of the present invention also provides an apparatus for speech processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: collecting voice signals of a sound source through a microphone array; determining a target language corresponding to the voice signal according to the spatial information of the sound source relative to the microphone array; determining that the speech signal corresponds to a translation result of the target language; and outputting the translation result.

Fig. 6 is a block diagram illustrating a structure of an apparatus 900 for speech processing as a terminal according to an exemplary embodiment. For example, the apparatus 900 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, apparatus 900 may include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.

The processing component 902 generally controls overall operation of the device 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operation at the device 900. Examples of such data include instructions for any application or method operating on device 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 906 provides power to the various components of the device 900. The power components 906 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 900.

The multimedia component 908 comprises a screen providing an output interface between the device 900 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when apparatus 900 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.

I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 914 includes one or more sensors for providing status assessment of various aspects of the apparatus 900. For example, the sensor assembly 914 may detect an open/closed state of the device 900, the relative positioning of the components, such as a display and keypad of the apparatus 900, the sensor assembly 914 may also detect a change in the position of the apparatus 900 or a component of the apparatus 900, the presence or absence of user contact with the apparatus 900, orientation or acceleration/deceleration of the apparatus 900, and a change in the temperature of the apparatus 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communications between the apparatus 900 and other devices in a wired or wireless manner. The apparatus 900 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 904 comprising instructions, executable by the processor 920 of the apparatus 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 7 is a schematic diagram of a server in some embodiments of the invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (terminal or server), enable the apparatus to perform a speech processing method, the method comprising: collecting voice signals of a sound source through a microphone array; determining a target language corresponding to the voice signal according to the spatial information of the sound source relative to the microphone array; determining that the speech signal corresponds to a translation result of the target language; and outputting the translation result.

The embodiment of the invention discloses A1 and a voice processing method, which comprises the following steps:

collecting voice signals of a sound source through a microphone array;

and outputting the translation result.

A2, according to the method in A1, the determining the target language corresponding to the voice signal includes:

and searching a mapping relation between the spatial information and a target language according to the spatial information of the sound source relative to the microphone array so as to obtain the target language corresponding to the voice signal.

A3, the method of A1, the outputting the translation result comprising: outputting the translation result after the voice signal is finished;

the method further comprises the following steps:

performing voice activation detection on the voice signal;

and after detecting the end point of the voice signal, judging that the voice signal is ended.

A4, the method of any of a 1-A3, the microphone array comprising: a first microphone element and a second microphone element; the first microphone element is matched with the voice signal;

the method further comprises:

switching off the second microphone element.

A5, the method of A4, the method further comprising:

after outputting the translation result, turning on the second microphone element.

A6, the method of any one of A1 to A3, the method further comprising:

displaying an interface; the interface includes: at least two regions oppositely arranged; the texts in different areas correspond to different languages;

and marking and displaying the area matched with the voice signal in the interface.

A7, the method of A6, the at least two regions comprising: a first region and a second region; the first region is matched with the voice signal;

The embodiment of the invention discloses B8 and a voice processing device, wherein the device comprises:

B9, the apparatus of B8, the target language determination module comprising:

B10, the device according to B8, the translation result output module is specifically used for outputting the translation result after the voice signal is finished;

the device further comprises:

B11, the apparatus of any of B8-B10, the microphone array comprising: a first microphone element and a second microphone element; the first microphone element is matched with the voice signal;

the apparatus further comprises:

a closing module for closing the second microphone element.

B12, the apparatus of B11, the apparatus further comprising:

and the starting module is used for starting the second microphone element after the translation result is output.

B13, the apparatus according to any one of B8 to B10, further comprising:

the first display module is used for displaying an interface; the interface includes: at least two regions oppositely arranged; the texts in different areas correspond to different languages;

B14, the apparatus of B13, the at least two regions comprising: a first region and a second region; the first region is matched with the voice signal;

The embodiment of the invention discloses C15, an apparatus for speech processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for:

collecting voice signals of a sound source through a microphone array;

and outputting the translation result.

C16, the apparatus according to C15, the determining the target language corresponding to the speech signal includes:

C17, the apparatus of C15, the outputting the translation result comprising: outputting the translation result after the voice signal is finished;

the device is also configured to execute, by one or more processors, the one or more programs including instructions for:

performing voice activation detection on the voice signal;

C18, the apparatus of any of C15-C17, the microphone array comprising: a first microphone element and a second microphone element; the first microphone element is matched with the voice signal;

the device is also configured to execute the one or more programs by the one or more processors including instructions for:

switching off the second microphone element.

C19, the apparatus of C18, the apparatus further comprising:

C20, the device of any of C15-C17, the device also configured to execute the one or more programs by one or more processors including instructions for:

C21, the apparatus of C20, the at least two regions comprising: a first region and a second region; the first region is matched with the voice signal;

Embodiments of the present invention disclose D22, one or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform a speech processing method as described in one or more of a 1-a 7.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The foregoing has described in detail a speech processing method, a speech processing apparatus and a speech processing apparatus, and a machine-readable medium according to the present invention, and the present invention applies specific examples to explain the principles and embodiments of the present invention, and the descriptions of the above examples are only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of speech processing, the method comprising:

collecting voice signals of a sound source through a microphone array;

and outputting the translation result.

2. The method of claim 1, wherein the determining the target language to which the speech signal corresponds comprises:

3. The method of claim 1, wherein outputting the translation result comprises: outputting the translation result after the voice signal is finished;

the method further comprises the following steps:

performing voice activation detection on the voice signal;

4. A method as claimed in any one of claims 1 to 3, characterized in that the microphone array comprises: a first microphone element and a second microphone element; the first microphone element is matched with the voice signal;

the method further comprises:

switching off the second microphone element.

5. The method of claim 4, further comprising:

6. The method according to any one of claims 1 to 3, further comprising:

7. The method of claim 6, wherein the at least two regions comprise: a first region and a second region; the first region is matched with the voice signal;

8. A speech processing apparatus, characterized in that the apparatus comprises:

9. An apparatus for speech processing comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors the one or more programs including instructions for:

collecting voice signals of a sound source through a microphone array;

and outputting the translation result.

10. One or more machine-readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform a speech processing method as recited in one or more of claims 1-7.