CN106528545B

CN106528545B - Voice information processing method and device

Info

Publication number: CN106528545B
Application number: CN201610912091.5A
Authority: CN
Inventors: 薄川川; 赵千千; 张熙文
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2016-10-19
Filing date: 2016-10-19
Publication date: 2020-03-17
Anticipated expiration: 2036-10-19
Also published as: CN106528545A

Abstract

The invention discloses a method and a device for processing voice information, wherein the method for processing the voice information comprises the following steps: acquiring voice information and target position information of a sound source; determining a target translation strategy according to the target position information; translating the voice information by using the target translation strategy to obtain translation information; the translation information is output. The processing method of the voice information can realize translation operation without repeatedly selecting a translation mode by a user, and has simple operation and high conversation efficiency.

Description

Voice information processing method and device

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for processing voice information.

Background

With the rapid development of world economy, communication between people using different languages is increasing, and in a double-person conversation scene, if both parties of conversation do not know the language of the other party, translation of conversation contents of both parties is often needed by means of a translator so as to realize communication between the two people. Although this method of translation by the translator can accurately convey the contents of the conversation between both parties, it is expensive.

In order to realize low-cost communication, translation is more preferred to be realized by means of translation software, that is, in the conversation process, conversation contents of a user are collected through a microphone, the conversation contents are analyzed through the translation software, then the analyzed conversation contents are translated by using a translation language specified by the user, and the translated data are played through voice, so that communication between two people is realized. However, this communication method has a great disadvantage: after the speech is collected each time, the user stops to manually select the required translation language, so that the conversation operation is complicated and the conversation efficiency is low.

Disclosure of Invention

The invention aims to provide a method and a device for processing voice information, which aim to solve the technical problems of complex operation and low conversation efficiency of the existing voice translation method.

In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:

a method of processing speech information, comprising:

acquiring voice information and target position information of a sound source;

determining a target translation strategy according to the target position information;

translating the voice information by using the target translation strategy to obtain translation information;

and outputting the translation information.

In order to solve the above technical problems, embodiments of the present invention further provide the following technical solutions:

an apparatus for processing speech information, comprising:

the acquisition module is used for acquiring voice information and target position information of a sound source;

the determining module is used for determining a target translation strategy according to the target position information;

the translation module is used for translating the voice information by using the target translation strategy to obtain translation information;

and the output module is used for outputting the translation information.

According to the voice information processing method and device, the voice information and the target position information of the sound source are obtained, the target translation strategy is determined according to the target position information, then the voice information is translated by the target translation strategy to obtain the translation information, and the translation information is output, so that the translation operation can be realized without repeatedly inputting a translation mode by a user, the operation is simple, and the conversation efficiency is high.

Drawings

The technical solution and other advantages of the present invention will become apparent from the following detailed description of specific embodiments of the present invention, which is to be read in connection with the accompanying drawings.

Fig. 1a is a schematic view of a scenario of a system for processing voice information according to an embodiment of the present invention.

Fig. 1b is a flowchart illustrating a method for processing voice information according to an embodiment of the present invention.

Fig. 2a is a flowchart illustrating a method for processing voice information according to an embodiment of the present invention.

Fig. 2b is a schematic diagram of a dual-microphone acquisition process according to an embodiment of the present invention.

Fig. 3a is a schematic structural diagram of a speech information processing apparatus according to an embodiment of the present invention.

Fig. 3b is a schematic structural diagram of another apparatus for processing voice information according to an embodiment of the present invention.

Fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a method, a device and a system for processing voice information.

Referring to fig. 1a, the system for processing voice information may include any one of the voice information processing apparatuses provided in the embodiments of the present invention, and the voice information processing apparatus may be specifically integrated in a terminal, where the terminal may be a mobile phone, a tablet computer, or other equipment with a translation function.

The terminal can acquire the voice information and the target position information of the sound source, determine a target translation strategy according to the target position information, translate the voice information by using the target translation strategy to obtain translation information, and output the translation information.

The sound source may include a person or a sound-producing object, such as a voice playing device during a video call. The target position information may refer to a relative position of a sound source and a terminal, and is mainly used for distinguishing different speaking objects. The target translation policy may be determined according to actual requirements, and generally includes an initial language to be translated and a target language to be translated finally, for example, if the target translation policy is "translate chinese into english", the initial language is chinese, and the target language is target languageEnglish is used. When the two parties P of the conversation₁And P₂When the terminal is positioned at two sides of the terminal for conversation, the current speaker can be judged according to the position information of the speaking object relative to the terminal, so that a proper target translation strategy is selected for translation, and the translated contents can be further played through a loudspeaker, so that two parties P in the conversation can conveniently play₁And P₂Can be heard.

The details will be described below separately. The numbers in the following examples are not intended to limit the order of priority of the examples.

First embodiment

The present embodiment will be described from the viewpoint of a processing apparatus of voice information, which can be integrated in a terminal.

Referring to fig. 1b, fig. 1b specifically describes a method for processing voice information according to a first embodiment of the present invention, which may include:

s101, acquiring voice information and target position information of a sound source.

In this embodiment, the sound source may include a person or a sound-emitting object, such as a voice playing device during a video call. The voice information may include information such as voice content, volume, and timbre. The target position information may refer to a relative position of a sound source and a terminal (or a component built in the terminal), and is mainly used for distinguishing speaking objects at different positions. The voice information can be acquired through the voice acquisition equipment, the target position information can be acquired according to the acquired voice information, and the target position information can also be acquired through detection of some detection devices, for example, the target position information can be acquired through induction of infrared equipment built in the terminal.

For example, the step S101 may specifically include:

1-1, respectively collecting the sound emitted by the sound source by using a plurality of audio collecting units to obtain a plurality of voice messages with the same voice content.

In this embodiment, the audio capturing unit may include a microphone, and the plurality of audio capturing units may be represented as a microphone array, where each audio capturing unit has a different installation location in the terminal, and the number of the plurality of audio capturing units may be determined according to actual requirements, such as 2 or 3, and so on.

And 1-2, determining target position information according to the voice information and the audio acquisition unit.

For example, the step 1-2 may specifically include:

acquiring the volume value of each voice message and the identification of each audio acquisition unit;

and determining the target position information according to the volume value and the identification.

In this embodiment, the voice information may be digitized (for example, fourier transform) to obtain a volume value. The identifier is mainly used for distinguishing different audio acquisition units, and can be set according to the installation positions of the audio acquisition units in the terminal, for example, the identifier of the audio acquisition unit from left to right can be sequentially set to be M1, M2 and Mn. The target position information mainly refers to the position of the sound source relative to the audio acquisition unit, and may be represented in various forms, such as the position "left", "center" or "right", or the markers M1, M2 or Mn, where each marker represents a position, but may also be represented as a marker set having a sorting rule, such as M1M2M3 or M1M3M2, for improving the accuracy.

It should be noted that, because each audio acquisition unit has a different installation position in the terminal, and the closer to the sound source, the greater the volume acquired by the audio acquisition unit, for the same sound source, the content and tone of the voice information acquired by each audio acquisition unit are the same, and the volumes are different, so that the target position information of the sound source can be determined as long as the volume value of the voice acquired by each audio acquisition unit is known, that is, the position of the sound source relative to the audio acquisition unit can be determined according to the volume value.

For example, the step of "determining the target location information according to the volume value and the identifier" may specifically include:

2-1, acquiring the identifier of the audio acquisition unit corresponding to the maximum volume value, or sorting the identifiers of the audio acquisition units according to the volume value to acquire a sorted identifier set;

and 2-2, determining the acquired identifier or the sorted identifier set as target position information.

In this embodiment, the identifiers may be sorted in order from large to small or from small to large according to the volume value, so as to obtain a sorted identifier set. It is easy to understand that, since there are various expression forms of the target location information, there are also various ways of determining the target location information, for example, when the target location information is expressed as a tag or a tag set with a sorting rule, the target location information may be directly the acquired tag or the sorted tag set. For example, when the target location information is represented as the azimuth information, it is further required to search the corresponding azimuth information from a preset azimuth information base according to the obtained identifier or the sorted identifier set as the target location information, where the identifier or the association relationship between the identifier set and the azimuth information is stored in the azimuth information base, and the association relationship may be set by a manufacturer when the terminal leaves a factory, for example, the association relationship is set by the manufacturer when the terminal leaves the factory: m1 or M1M2 corresponds to the azimuth information "left", M2 or M2M1 corresponds to the azimuth information "right", and so on.

And S102, determining a target translation strategy according to the target position information.

In this embodiment, the target translation policy may be determined according to actual requirements, and may generally include an initial language to be translated and a target language to be translated finally, for example, if the target translation policy is "translate chinese into english", the initial language is chinese, and the target language is english.

For example, the step S102 may specifically include:

selecting a corresponding translation strategy from the established translation strategy set according to the target position information;

and determining the selected translation strategy as a target translation strategy.

In this embodiment, the translation policies in the set of translation policies may be determined according to actual requirements, and may include "translate chinese into english", "translate japanese into english", or "translate english into chinese", and so on. In the practical application process, the association relationship between the translation policy and the position information needs to be established in the translation policy set in advance, and at this time, the target position information may be detected by a built-in device in the terminal, such as a camera, or may be determined by acquiring the voice information of the user through a plurality of audio acquisition devices.

When the position information in the association relationship is determined by a plurality of audio capture devices capturing voice information of the user, before the step S101, the method for processing the voice information may further include:

collecting the first voice information of a sound source by using the audio collection unit;

acquiring a current translation strategy input by a user;

and establishing a translation strategy set according to the first voice information and the current translation strategy.

In this embodiment, the first voice information may be voice collected for the first time when the terminal starts the voice translation function, and for ensuring the accuracy of subsequent position information detection, the first voice information may be composed of multiple voice segments, or may be a voice segment with a specified duration.

For example, the step of "establishing a translation policy set according to the first voice information and the current translation policy" may specifically include:

acquiring a volume value of first voice information;

determining current position information according to the volume value of the first voice information and the identification of the audio acquisition unit;

and establishing a translation strategy set according to the current position information and the current translation strategy.

In this embodiment, the current position information may be determined in a variety of ways, for example, if the expression form of the current position information is an identifier or an identifier set, the identifier of the audio acquisition unit corresponding to the maximum volume value may be obtained, or the identifiers may be sorted in the order from the maximum volume value to the minimum volume value or from the minimum volume value to the maximum volume value to obtain a sorted identifier set, where the obtained identifier or the sorted identifier set is the current position information and the current position information needs to be stored in the translation policy set.

For example, if the expression form of the current location information is the orientation information, the current location information of the user may be further determined according to the obtained identifier or the sorted identifier set, for example, the obtained identifier or the sorted identifier set is matched with the orientation information base in step 2-2, and the orientation information obtained by matching is the current location information, such as "left" or "right" and the like.

In addition, the current location information may also be manually input by the user, for example, the terminal may display a location information selection box to the user, and a plurality of location information such as "left", "center", and "right" may be provided in the selection box for the user to select, or the current location information may be self-detected by the terminal through a built-in device, and so on.

For example, the step of "establishing a translation policy set according to the current location information and the current translation policy" may specifically include:

establishing an incidence relation between the current position information and the current translation strategy;

the association is stored in a set of translation policies.

In this case, the step of "selecting a corresponding translation policy from the established translation policy set according to the target location information" may specifically include:

and selecting a translation strategy corresponding to the target position information from the established translation strategy set according to the incidence relation.

In this embodiment, if the two parties (or multiple parties) in the conversation process have set the information of the speaking voice, the position and the translation policy of any party when speaking for the first time, then under the condition that the station position is not changed, when speaking for any party, the terminal can determine the target position information of the user according to the collected voice information and search the corresponding translation policy according to the target position information to perform translation operation, without manual selection of the user, the operation is simple and convenient, the probability of conversation interruption can be reduced to the greatest extent, and the communication fluency can be improved.

S103, translating the voice information by using the target translation strategy to obtain translation information.

In this embodiment, the initial language to be translated may be used to perform semantic analysis on the voice information, and then the target language to be translated is used to express the analyzed semantic information, so as to obtain the translation information.

And S104, outputting the translation information.

In this embodiment, the translated content may be played in voice through a device such as a speaker, so that the user can hear the content. It should be noted that the plurality of audio capture units may not perform a voice capture operation during playback.

As can be seen from the above, in the method for processing voice information provided in this embodiment, the voice information of the sound source and the target location information are obtained, the target translation policy is determined according to the target location information, and then the voice information is translated by using the target translation policy to obtain the translation information, and the translation information is output.

Second embodiment

The method described in the first embodiment is further illustrated by way of example.

In the present embodiment, the speech information processing device is integrated into the terminal, and the number of participants in a conversation will be described in detail as an example of two persons.

As shown in fig. 2a and fig. 2b, a specific flow of a method for processing voice information may be as follows:

s201, the terminal respectively utilizes a plurality of audio acquisition units to acquire the first voice information of a sound source and acquires the current translation strategy input by a user.

For example, the plurality of audio capturing units may be two microphones, and the sound source may be two conversation parties P₁Or P₂. The first voice message may be a piece of voice message with a collection duration of 1 minute. Specifically, the terminal collects the first voice information, P₁Or P₂It also needs to manually input the translation strategy required by the user,for example, the terminal may provide a translation policy selection box for the user to select, and the translation policy selection box may include multiple options of "translate chinese into english", "translate japanese into english", and "translate english into chinese".

S202, the terminal obtains the volume value of the first voice information and the identification of each audio acquisition unit.

For example, the identities of the plurality of audio capture units may be labeled M1 and M2 in order from left to right. The volume value is a volume average, which may include L₁And L₂Wherein L is₁30 decibels, L₂34 db, and M1 corresponds to a volume value L₁M2 corresponds to a volume value of L₂。

S203, the terminal determines the current position information according to the volume value and the identification of the first voice information.

For example, if the current position information is represented by a logo or a logo set, the logo M2 of the audio capturing unit corresponding to the maximum volume value may be obtained, or the logos are sorted in the order from large to small or from small to large to obtain a sorted logo set M2M1, where M2 or M2M1 is the current position information.

If the expression form of the current location information is the location information, such as "left", "center", or "right", the current location information of the user may be further determined according to the obtained identifier M2 or the sorted identifier set M2M1, for example, the obtained identifier M2 or the sorted identifier set M2M1 is matched with the location information base to obtain the current location information "right", where the location information base stores the identifier or the association relationship between the identifier set and the location information, which may be already set by the manufacturer when the terminal leaves the factory, such as when the terminal leaves the factory: m1 or M1M2 corresponds to the azimuth information "left", M2 or M2M1 corresponds to the azimuth information "right", and so on.

S204, the terminal establishes the association relationship between the current position information and the current translation strategy and stores the association relationship in the translation strategy set.

For example, the current location information M2 or M2M1 or "right" is associated and stored with the current translation policy "translate chinese into english", and the current location information M1 or M1M2 or "left" is associated and stored with the current translation policy "translate english into chinese".

S205, the terminal collects the voice information of the sound source by using the audio collection unit and obtains the volume value of each voice information.

For example, after the terminal establishes the translation policy set, in the situation that the location of the terminal is not changed, as long as either party starts speaking, the terminal can collect the voice information at that moment by using the microphone, and can determine that the current speaker is P according to the voice information at that moment₁Or P₂Therefore, a proper translation strategy is selected, the translation strategy can be determined without acquiring a section of voice, and the method is convenient and quick.

S206, the terminal determines the target position information according to the volume value and the identification of each voice message.

For example, when the target location information represents a tag or a tag set having a sorting rule, the target location information may be directly the obtained tag or the sorted tag set. When the target location information is represented as the azimuth information, it is further required to search the corresponding azimuth information from the azimuth information base according to the obtained identifier or the sorted identifier set as the target location information, for example, the azimuth information searched according to M1 or M1M2 is "left", that is, it can be determined that the current dialog person is P person₁。

And S207, the terminal selects a corresponding translation strategy from the translation strategy set as a target translation strategy according to the target position information.

For example, the terminal may determine the target translation policy "translate chinese into english" from the set of translation policies based on the target location information M1 or M1M2 or "left".

And S208, the terminal translates the voice information by using the target translation strategy to obtain translation information and outputs the translation information.

For example, the terminal can translate the Chinese speech spoken by P1 into English speech and play it through the speaker so that P2 can hear it.

As can be seen from the above, in the method for processing voice information provided in this embodiment, the terminal may respectively use a plurality of audio collecting units to collect the first voice information of the sound source and obtain the current translation policy input by the user, then obtain the volume value of the first voice information and the identifier of each audio collecting unit, determine the current location information according to the volume value and the identifier of the first voice information, then establish the association relationship between the current location information and the current translation policy, and store the association relationship in the translation policy set, so that, in the process of speaking of the sound source, the terminal may use the audio collecting unit to collect the voice information of the sound source and obtain the volume value of each voice information, then determine the target location information according to the volume value and the identifier of each voice information, and select the corresponding translation policy from the translation policy set as the target translation policy according to the target location information, and then, the target translation strategy is utilized to translate the voice information to obtain translation information, and the translation information is output, so that a user can realize subsequent translation operation only by inputting the translation strategy once without repeatedly inputting, the operation is simple, the conversation can be prevented from being interrupted as much as possible, the communication smoothness is good, and the conversation efficiency is high.

Third embodiment

Based on the methods in the first and second embodiments, this embodiment will be further described from the perspective of a processing apparatus for voice information, please refer to fig. 3a, where fig. 3a specifically describes the processing apparatus for voice information provided by the third embodiment of the present invention, which may include: an obtaining module 10, a determining module 20, a translating module 30 and an outputting module 40, wherein:

(1) acquisition module 10

And the acquisition module 10 is used for acquiring the voice information and the target position information of the sound source.

In this embodiment, the sound source may include a person or a sound-emitting object, such as a voice playing device during a video call. The voice information may include information such as voice content, volume, and timbre. The target position information may refer to a relative position of a sound source and a terminal (or a component built in the terminal), and is mainly used for distinguishing speaking objects at different positions. The obtaining module 10 may obtain the voice information through a sound collecting device, and may obtain the target location information according to the collected voice information or some detecting devices, for example, the target location information may be sensed through an infrared device built in the terminal.

For example, referring to fig. 3b, the obtaining module 10 may specifically include: a first acquisition submodule 11 and a first determination submodule 12, wherein,

the first collecting submodule 11 is configured to collect sounds emitted by a sound source by using a plurality of audio collecting units, respectively, to obtain a plurality of pieces of speech information having the same speech content.

And the first determining submodule 12 is used for determining the target position information according to the voice information and the audio acquisition unit.

For example, the first determining submodule 12 may be specifically configured to:

In this embodiment, the first determining sub-module 12 may perform a digital process (such as fourier transform) on the voice information to obtain a volume value. The identifier is mainly used for distinguishing different audio acquisition units, and can be set according to the installation positions of the audio acquisition units in the terminal, for example, the identifier of the audio acquisition unit from left to right can be sequentially set to be M1, M2 and Mn. The target position information mainly refers to the position of the sound source relative to the audio acquisition unit, and may be represented in various forms, such as the position "left", "center" or "right", or the markers M1, M2 or Mn, where each marker represents a position, but may also be represented as a marker set having a sorting rule, such as M1M2M3 or M1M3M2, for improving the accuracy.

acquiring the identifier of the audio acquisition unit corresponding to the maximum volume value, or sorting the identifiers of the audio acquisition units according to the volume value to acquire a sorted identifier set;

and determining the acquired identifier or the sorted identifier set as target position information.

In this embodiment, the first determining submodule 12 may rank the identifiers in order from large to small or from small to large according to the volume value, so as to obtain a ranked identifier set. It is easy to understand that, since there are various expression forms of the target location information, there are also various ways of determining the target location information, for example, when the target location information is expressed as a tag or a tag set with a sorting rule, the first determining sub-module 12 may directly use the obtained tag or the sorted tag set as the target location information. For example, when the target location information is represented as the azimuth information, the first determining sub-module 12 needs to further search, according to the obtained identifier or the sorted identifier set, the corresponding azimuth information from a preset azimuth information base as the target location information, where the association relationship between the identifier or the identifier set and the azimuth information is stored in the azimuth information base, and may be that the terminal has been set by a manufacturer when the terminal leaves a factory, for example, that the terminal has been set when the terminal leaves the factory: m1 or M1M2 corresponds to the azimuth information "left", M2 or M2M1 corresponds to the azimuth information "right", and so on.

(2) Determination module 20

And a determining module 20, configured to determine a target translation policy according to the target location information.

For example, the determining module 20 may specifically include: a selection submodule 21 and a second determination submodule 22, wherein:

and the selection submodule 21 is configured to select a corresponding translation policy from the established translation policy set according to the target location information.

And a second determining submodule 22 for determining the selected translation policy as the target translation policy.

In this embodiment, the translation policies in the set of translation policies may be determined according to actual requirements, and may include "translate chinese into english", "translate japanese into english", or "translate english into chinese", and so on. In the practical application process, the association relationship between the translation policy and the position information needs to be established in the translation policy set in advance, and at this time, the target position information may be detected by a built-in device in the terminal, such as a camera, or determined by acquiring the voice information of the user through a plurality of audio acquisition devices

When the position information in the association relationship is determined by acquiring the voice information of the user through a plurality of audio acquisition devices, the processing apparatus of the voice information may further include an establishing module 50, and the establishing module 50 may include: a second acquisition submodule 51, an acquisition submodule 52 and a setup submodule 53, wherein:

a second collecting submodule 51, configured to collect, by using the audio collecting unit, first voice information of a sound source before the obtaining module obtains the voice information of the sound source and the target position information;

the obtaining sub-module 52 is used for obtaining the current translation strategy input by the user;

and the establishing sub-module 53 is used for establishing a translation strategy set according to the first voice information and the current translation strategy.

For example, the establishing sub-module 53 may specifically include:

the acquiring unit is used for acquiring the volume value of the first voice information;

the determining unit is used for determining the current position information according to the volume value of the first voice information and the identification of the audio acquisition unit;

and the first establishing unit is used for establishing a translation strategy set according to the current position information and the current translation strategy.

In this embodiment, the current position information may be determined in a variety of manners, for example, if the expression form of the current position information is an identifier or an identifier set, the determining unit may obtain the identifier of the audio acquisition unit corresponding to the maximum volume value, or rank the identifiers in order of the volume values from large to small or from small to large to obtain a ranked identifier set, where the obtained identifier or the ranked identifier set is the current position information and the current position information needs to be stored in the translation policy set.

For example, if the expression form of the current location information is the orientation information, the determining unit may further determine the current location information of the user according to the obtained identifier or the sorted identifier set, for example, match the obtained identifier or the sorted identifier set with the orientation information base, and the orientation information obtained by matching is the current location information, such as "left" or "right" and the like.

For example, the first establishing unit may specifically be configured to:

establishing an incidence relation between the current position information and the current translation strategy; the association is stored in a set of translation policies.

In this case, the selection submodule 21 may be specifically configured to:

In this embodiment, if the two parties (or multiple parties) in the conversation process have set the information of the speaking voice, the position and the translation policy of any party during the first speaking, then under the condition that the station position is not changed, when any party speaks, the first determining sub-module 12 can determine the target position information of the user according to the collected voice information, and the selecting sub-module 21 searches the corresponding translation policy according to the target position information to perform the translation operation, without the need of manual selection by the user, the operation is simple and convenient, the probability of interrupted conversation can be reduced to the greatest extent, and the communication fluency can be improved.

(3) Translation module 30

And the translation module 30 is configured to translate the voice information by using the target translation policy to obtain translation information.

In this embodiment, the translation module 30 may perform semantic analysis on the voice information by using the initial language to be translated, and then express the analyzed semantic by using the target language to be translated finally to obtain the translation information.

(4) Output module 40

And an output module 40, configured to output the translation information.

In this embodiment, the output module 40 may perform voice playing on the translated content through a device such as a speaker, so that the user can hear the content. It should be noted that the plurality of audio capture units may not perform a voice capture operation during playback.

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above, in the apparatus for processing voice information according to this embodiment, the obtaining module 10 obtains the voice information of the sound source and the target position information, the determining module 20 determines the target translation policy according to the target position information, then the translating module 30 translates the voice information by using the target translation policy to obtain the translation information, and the output module 40 outputs the translation information.

Fourth embodiment

Correspondingly, an embodiment of the present invention further provides a system for processing voice information, including any one of the devices for processing voice information provided in the embodiments of the present invention, and the device for processing voice information may specifically refer to embodiment three.

The processing device of the voice information may be specifically integrated in the terminal, and for example, may be as follows:

and the terminal is used for acquiring the voice information and the target position information of the sound source, determining a target translation strategy according to the target position information, translating the voice information by using the target translation strategy to obtain translation information and outputting the translation information.

The specific implementation of each device can be referred to the previous embodiment, and is not described herein again.

Since the processing system for verification information may include any processing apparatus for voice information provided in the embodiment of the present invention, beneficial effects that can be achieved by any processing apparatus for voice information provided in the embodiment of the present invention can be achieved, for details, see the foregoing embodiment, and are not described herein again.

Fifth embodiment

Accordingly, an embodiment of the present invention further provides a terminal, as shown in fig. 4, the terminal may include a Radio Frequency (RF) circuit 601, a memory 602 including one or more computer-readable storage media, an input unit 603, a display unit 604, a sensor 605, an audio circuit 606, a Wireless Fidelity (WiFi) module 607, a processor 608 including one or more processing cores, and a power supply 609. Those skilled in the art will appreciate that the terminal configuration shown in fig. 4 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the RF circuit 601 may be used for receiving and transmitting signals during a message transmission or communication process, and in particular, for receiving downlink messages from a base station and then processing the received downlink messages by one or more processors 608; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuit 601 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 601 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The memory 602 may be used to store software programs and modules, and the processor 608 executes various functional applications and data processing by operating the software programs and modules stored in the memory 602. The memory 602 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the terminal, etc. Further, the memory 602 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 608 and the input unit 603 access to the memory 602.

The input unit 603 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, input unit 603 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 608, and can receive and execute commands sent by the processor 608. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 603 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 604 may be used to display information input by or provided to the user and various graphical user interfaces of the terminal, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 604 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 608 to determine the type of touch event, and the processor 608 then provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 4 the touch-sensitive surface and the display panel are shown as two separate components to implement input and output functions, in some embodiments the touch-sensitive surface may be integrated with the display panel to implement input and output functions.

The terminal may also include at least one sensor 605, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or the backlight when the terminal is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when the mobile phone is stationary, and can be used for applications of recognizing the posture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured in the terminal, detailed description is omitted here.

Audio circuitry 606, a speaker, and a microphone may provide an audio interface between the user and the terminal. The audio circuit 606 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signal into an electric signal, which is received by the audio circuit 606 and converted into audio data, which is then processed by the audio data output processor 608, and then transmitted to, for example, another terminal via the RF circuit 601, or the audio data is output to the memory 602 for further processing. The audio circuit 606 may also include an earbud jack to provide communication of peripheral headphones with the terminal.

WiFi belongs to short-distance wireless transmission technology, and the terminal can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 607, and provides wireless broadband internet access for the user. Although fig. 4 shows the WiFi module 607, it is understood that it does not belong to the essential constitution of the terminal, and may be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 608 is a control center of the terminal, connects various parts of the entire handset using various interfaces and lines, and performs various functions of the terminal and processes data by operating or executing software programs and/or modules stored in the memory 602 and calling data stored in the memory 602, thereby performing overall monitoring of the handset. Optionally, processor 608 may include one or more processing cores; preferably, the processor 608 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 608.

The terminal also includes a power supply 609 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 608 via a power management system that may be used to manage charging, discharging, and power consumption. The power supply 609 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown, the terminal may further include a camera, a bluetooth module, and the like, which will not be described herein. Specifically, in this embodiment, the processor 608 in the terminal loads the executable file corresponding to the process of one or more application programs into the memory 602 according to the following instructions, and the processor 608 runs the application programs stored in the memory 602, thereby implementing various functions:

acquiring voice information and target position information of a sound source;

the translation information is output.

The implementation method of the above operations may specifically refer to the above embodiments, and details are not described herein.

The terminal can achieve the effective effect that can be achieved by any one of the voice information processing devices provided by the embodiments of the present invention, which is detailed in the foregoing embodiments and will not be described herein again.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

The method, the apparatus and the system for processing voice information provided by the embodiment of the present invention are described in detail above, a specific example is applied in the text to explain the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for processing voice information, comprising:

acquiring voice information and target position information of a sound source, wherein the target position information is the relative position of the sound source and a terminal, and the relative position is obtained by detecting the position of the sound source by the terminal;

and outputting the translation information.

2. The method for processing voice information according to claim 1, wherein the relative position is obtained by the terminal detecting the position of the sound source through a sound collecting device or an infrared device.

3. The method for processing the voice information according to claim 2, wherein when the relative position is obtained by the terminal detecting the position of the sound source through a sound collection device, the sound collection device includes a plurality of audio collection units, and the obtaining the voice information of the sound source and the target position information includes:

respectively collecting the sound emitted by the sound source by using the plurality of audio collecting units to obtain a plurality of voice information with the same voice content;

determining the target position according to the voice information and the audio acquisition unit

And (4) information.

4. The method for processing the voice information according to claim 3, wherein the determining the target position information according to the voice information and the audio collecting unit comprises:

and determining target position information according to the volume value and the identification.

5. The method for processing the voice message according to claim 1, wherein the determining a target translation policy according to the target location information comprises:

6. The method for processing speech information according to claim 5, further comprising, before acquiring target position information of a sound source:

collecting first voice information of a sound source by using an audio collection unit;

acquiring a current translation strategy input by a user;

7. The method for processing voice information according to claim 6, wherein the establishing a translation policy set according to the first voice information and the current translation policy comprises:

acquiring a volume value of first voice information;

8. The method for processing speech information according to claim 7,

the establishing of the translation strategy set according to the current position information and the current translation strategy comprises the following steps: establishing an incidence relation between the current position information and the current translation strategy; storing the association relationship in a set of translation policies;

the selecting a corresponding translation policy from the established translation policy set according to the target location information includes: and selecting a translation strategy corresponding to the target position information from the established translation strategy set according to the incidence relation.

9. An apparatus for processing speech information, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring voice information and target position information of a sound source, the target position information is the relative position of the sound source and a terminal, and the relative position is obtained by the terminal through detecting the position of the sound source;

and the output module is used for outputting the translation information.

10. The apparatus for processing voice information according to claim 9, wherein the relative position is obtained by the terminal detecting the position of the sound source through a sound collecting device or an infrared device.

11. The apparatus for processing speech information according to claim 10, wherein when the relative position is obtained by the terminal detecting the position of the sound source through a sound collection device, the sound collection device includes a plurality of audio collection units, and the obtaining module specifically includes:

the first acquisition submodule is used for acquiring the sound emitted by the sound source by utilizing the plurality of audio acquisition units respectively to obtain a plurality of voice information with the same voice content;

and the first determining submodule is used for determining the target position information according to the voice information and the audio acquisition unit.

12. The apparatus for processing speech information according to claim 11, wherein the first determining submodule is specifically configured to:

13. The apparatus for processing speech information according to claim 11, wherein the determining module specifically includes:

the selection submodule is used for selecting a corresponding translation strategy from the established translation strategy set according to the target position information;

and the second determining submodule is used for determining the selected translation strategy as a target translation strategy.

14. The apparatus for processing voice information according to claim 13, further comprising a setup module, wherein the setup module comprises:

the second acquisition submodule is used for acquiring the first voice information of the sound source by using the audio acquisition unit before the acquisition module acquires the voice information and the target position information of the sound source;

the obtaining submodule is used for obtaining a current translation strategy input by a user;

and the establishing submodule is used for establishing a translation strategy set according to the first voice information and the current translation strategy.

15. The apparatus for processing speech information according to claim 14, wherein the creating sub-module specifically includes:

16. The apparatus for processing voice information according to claim 15,

the first establishing unit is configured to: establishing an incidence relation between the current position information and the current translation strategy; storing the association relationship in a set of translation policies;

the selection submodule is used for: and selecting a translation strategy corresponding to the target position information from the established translation strategy set according to the incidence relation.

17. A computer-readable storage medium, storing a computer program, which, when run on a computer, causes the computer to execute the method of processing speech information according to any one of claims 1-8.

18. A terminal comprising a processor and a memory, the processor being electrically connected to the memory, the memory being configured to store instructions and data, the processor being configured to perform the steps of the method for processing speech information according to any one of claims 1 to 8.