CN116705027A - Voice information processing method and device, electronic equipment and readable storage medium - Google Patents

Voice information processing method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN116705027A
CN116705027A CN202210183336.0A CN202210183336A CN116705027A CN 116705027 A CN116705027 A CN 116705027A CN 202210183336 A CN202210183336 A CN 202210183336A CN 116705027 A CN116705027 A CN 116705027A
Authority
CN
China
Prior art keywords
voice information
sound source
sub
voice
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210183336.0A
Other languages
Chinese (zh)
Inventor
唐涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pateo Connect Nanjing Co Ltd
Original Assignee
Pateo Connect Nanjing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pateo Connect Nanjing Co Ltd filed Critical Pateo Connect Nanjing Co Ltd
Priority to CN202210183336.0A priority Critical patent/CN116705027A/en
Publication of CN116705027A publication Critical patent/CN116705027A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/22Position of source determined by co-ordinating a plurality of position lines defined by path-difference measurements
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L2013/021Overlap-add techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The application provides a voice information processing method, a voice information processing device, electronic equipment and a readable storage medium, and relates to the technical field of voice information processing. The voice information processing method comprises the following steps: acquiring a plurality of sub-voice information; the plurality of sub-voice information is voice information respectively acquired by a plurality of different microphones aiming at the same voice, the plurality of different microphones are arranged at intervals, each microphone corresponds to a sound receiving area, and areas among the plurality of sound receiving areas are critical areas; determining a sound source area of the voice according to the plurality of sub-voice information; the sound source area is the sound receiving area or the critical area; obtaining target voice information according to the sound source area and the plurality of sub-voice information; and obtaining text information according to the target voice information.

Description

Voice information processing method and device, electronic equipment and readable storage medium
Technical Field
The embodiment of the application relates to the technical field of voice information processing, in particular to a voice information processing method, a voice information processing device, electronic equipment and a readable storage medium.
Background
The vehicle-mounted terminals of part of the vehicles have a voice recognition function, and when in use, the vehicle-mounted terminals recognize the acquired voice information as text information and control the running of the vehicles or feed back specific contents to users according to the text information. Thus, the accuracy with which speech information is recognized as text information becomes a key factor affecting the user experience.
In the prior art, a plurality of microphones are arranged in a carriage at intervals, and each microphone acquires voice information when a user speaks. After the vehicle-mounted terminal synthesizes each voice message into one voice message, the blind source separation technology is utilized to separate the voice message to be recognized from the synthesized voice message, and then the voice message to be recognized is recognized as text message.
However, in the actual use process, inaccurate text information identification often occurs.
Disclosure of Invention
The embodiment of the application provides a voice information processing method, which has the advantages that a plurality of sub-voice information is acquired, a voice source area of voice is determined according to the plurality of sub-voice information, target voice information is obtained according to the voice source area and the plurality of sub-voice information, and text information is obtained according to the target voice information. That is, the target voice information for converting into text information is generated according to the sound source area, different target voice information is generated according to different sound source areas, and the influence of the position of the sound source on the voice information processing result is fully considered, so that the problem that the voice information separated by the blind source separation technology is inaccurate due to the fact that the phase difference of sub-voice information is small in the prior art is solved.
The embodiment of the application provides a voice information processing method, which has the advantages that: when the sound source is located in the critical area, it means that the phase difference between the sub-voice information received by the microphones is smaller, if the blind source separation technology is adopted to separate the whole voice information synthesized by the sub-voice information, the separated voice information tends to be inaccurate. Therefore, when the sound source area is the critical area, the whole car radio mode can be adopted to synthesize a plurality of pieces of sub-voice information into target voice information, namely, the whole voice information is determined to be the target voice information, so that the integrity of the voice information can be ensured, and the phenomenon of word loss and the like of the separated voice information can be prevented.
The embodiment of the application provides a voice information processing method, which has the advantages that: when the sound source is positioned in the sound receiving area, the phase difference between the sub-voice information received by the microphones is larger, the condition of the blind source separation technology is met, the whole voice information synthesized by the sub-voice information is separated by the blind source separation technology, and the accuracy of the separated voice information is higher.
The embodiment of the application provides a voice information processing method, which has the advantages that the sound source position of voice is determined according to a plurality of pieces of sub-voice information, and then the sound source area is determined according to the sound source position, so that the sound source area can be determined after the sub-voice information is acquired, and a judgment basis is provided for the selection of a subsequent voice information processing method.
The embodiment of the application provides a voice information processing method, which has the advantages that: the voice sent by the passenger respectively reaches different microphones at the same propagation speed, and the distance between the passenger and the different microphones is different, so that the time of the voice reaching the microphones is different, and the distance difference between the passenger and the different microphones can be determined according to the time difference of the voice reaching the microphones and the propagation speed of the sound, so that the position of the passenger, namely the position of the sound source, is determined.
The embodiment of the application provides a voice information processing method, which has the advantages that: after the passenger gives out the voice, the voice propagates in the air, and along with the increase of the propagation distance, the sound intensity of the voice is gradually attenuated, and the relation between the sound intensity attenuation and the propagation distance can be obtained through experiments. Therefore, after the microphone acquires the sound intensity of the voice, the distance between the microphone and the passenger can be reversely obtained, and the position of the passenger is further determined; alternatively, the distance difference between the passenger and the different microphones may be determined according to the sound intensity difference of the different sub-voice information, so as to determine the position of the passenger, i.e. the sound source position.
The embodiment of the application provides a voice information processing method, which has the advantages that: the judgment of the sound source position may be inaccurate or a plurality of sound source positions are determined, and when a passenger sounds, lip movement is usually accompanied, so that the judgment of the sound source region according to the lip movement information can improve the judgment accuracy of the sound source region.
In a first aspect, the present application provides a voice information processing method, including: acquiring a plurality of sub-voice information; the plurality of sub-voice information is voice information respectively acquired by a plurality of different microphones aiming at the same voice, the plurality of different microphones are arranged at intervals, each microphone corresponds to a sound receiving area, and areas among the plurality of sound receiving areas are critical areas; determining a sound source area of the voice according to the plurality of sub-voice information; the sound source area is the sound receiving area or the critical area; obtaining target voice information according to the sound source area and the plurality of sub-voice information; and obtaining text information according to the target voice information.
In a second aspect, the present application provides a voice information processing apparatus comprising:
the microphones are arranged at intervals, each microphone corresponds to a sound receiving area, areas among the sound receiving areas are critical areas, the microphones are used for acquiring a plurality of pieces of sub-voice information, and the plurality of pieces of sub-voice information are voice information respectively acquired by the plurality of different microphones for the same voice;
a processor electrically connected with the plurality of microphones for determining a sound source area of the voice according to the plurality of sub-voice information; the sound source area is the sound receiving area or the critical area; obtaining target voice information according to the sound source area and the plurality of sub-voice information; and obtaining text information according to the target voice information.
In a third aspect, the application provides an electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, the program or instruction implementing the method when executed by the processor.
In a fourth aspect, the present application provides a readable storage medium storing a program or instructions which when executed by a processor implement the method.
Drawings
FIG. 1 is a schematic view of the interior space distribution of a passenger vehicle;
FIG. 2 is a schematic illustration of a sound source located in a critical region;
FIG. 3 is a schematic illustration of a sound source located in a sound pickup area;
FIG. 4 is a flowchart of an embodiment of a method for processing voice information according to the present application;
FIG. 5 is a block diagram of an electronic device according to an embodiment of the present application;
fig. 6 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.
Detailed Description
Vehicles with voice recognition function are generally equipped with a microphone array for receiving voice, the microphone array comprises a plurality of microphones, each microphone corresponds to a sound receiving area, and the microphone has good voice receiving effect in the corresponding sound receiving area. Taking a passenger car as an example, fig. 1 is a schematic diagram of the internal space distribution of the passenger car, as shown in fig. 1, four seats are arranged in the internal space of the passenger car, a microphone is arranged corresponding to each seat, each microphone corresponds to a sound receiving area (a solid oval area shown in the figure), and the dotted oval areas between the sound receiving areas are critical areas. It will be appreciated that the arrangement of the microphone, the sound pickup area and the critical area is not limited thereto, and may be flexibly arranged according to the actual vehicle space, and the sound pickup areas may be independent of each other or may be partially or fully overlapped.
When a passenger in the vehicle sends out voice, each microphone in the microphone array respectively acquires one piece of sub-voice information, the vehicle-mounted terminal synthesizes the sub-voice information into one piece of voice information, and then a blind source separation technology is utilized to separate the needed voice information to be recognized from the synthesized voice information. For example, a driver and a child are present in a car, and the driver gives instruction voices for controlling the car while the child sings a child song in the car. The microphone receives instruction voice sent by a driver and child song voice of a child in the child voice information, wherein the child song voice is noise information and affects the recognition of the instruction voice, so that the instruction voice needs to be separated from the synthesized voice information by adopting a blind source separation technology.
The blind source separation technology separates the voice information mainly by means of phase differences in the sub-voice information acquired by different microphones. When the phase difference between the received sub-voice information is large, the separated voice information is generally accurate. For example, when a passenger who gives a voice message is located at the position of the speaker icon in fig. 2, since the passenger is located at a large distance from the rear two microphones, the phase difference of the sub-voice messages received by the two microphones is also large. When the phase difference between the received sub-voice information is small or no phase difference exists, the separated voice information is easy to have inaccuracy (such as word loss). For example, as shown in fig. 3, when a passenger who utters a voice is located at the position of the speaker icon in the figure, the distance between the passenger and the two microphones in the rear row is the same, resulting in a small or no phase difference in the sub-voice information received by the two microphones in the rear row.
In order to solve the above problems, the present application provides a method, an apparatus, an electronic device, and a readable storage medium for processing voice information, which determine the position of a sound source of the voice information after receiving the voice information, and adopt different voice processing methods according to different positions of the sound source, so as to solve the problem of inaccurate voice recognition caused by insufficient consideration of the position of the sound source in the prior art.
The following describes in detail a voice information processing method, a device, an electronic apparatus, and a readable storage medium provided by the present application with reference to the accompanying drawings.
Fig. 4 is a flowchart of an embodiment of a voice information processing method provided by the present application. The voice processing method can be applied to a vehicle and can also be applied to voice control of intelligent home, and for convenience of description, the application is described by taking the application to the vehicle as an example, a plurality of riding areas are arranged in the vehicle, a microphone is arranged corresponding to each riding area, the sound receiving area of the microphone is the riding area corresponding to the microphone, and the critical area is an area among the plurality of riding areas. As shown in fig. 4, the voice information processing method includes the steps of:
step 110, a plurality of sub-voice information is acquired.
Vehicles with speech recognition function are typically equipped with a microphone array for receiving speech, which comprises a plurality of microphones, which are arranged at intervals. Each microphone has a sound pickup area corresponding to its installation position, and when a sound source is located in the sound pickup area, the sound pickup effect of the microphone corresponding to the sound pickup area is the best. The area between the sound receiving areas is a critical area, and the critical area is close to the distance between the microphones in two adjacent sound receiving areas.
When a passenger in the vehicle emits voice, a plurality of microphones in the microphone array can receive sub-voice information corresponding to the voice. That is, the plurality of sub-voice information corresponds to the same voice, and is different in that since the positions of different microphones from the sound source may be different, there may be differences in intensity, phase, receiving time, etc. of the respective sub-voice information.
For example, a main driving region and a co-driving region are provided in the vehicle, a first microphone is provided in the main driving region, and a second microphone is provided in the co-driving region, and the main driving region is a first sound receiving region corresponding to the first microphone, the co-driving region is a second sound receiving region corresponding to the second microphone, and a region between the first sound receiving region and the second sound receiving region is a critical region (for example, a armrest box, a center control region, and a region extending upward to a roof). When the driver makes a voice in the main driving area, the first microphone receives first sub-voice information corresponding to the voice information, and the second microphone receives second sub-voice information corresponding to the voice information.
In practical application, the microphone array may be connected to the processor, and each microphone sends the sub-voice information to the processor after acquiring the sub-voice information. The processor can perform noise reduction processing, echo elimination processing and the like on the sub-voice information so as to improve the quality of the sub-voice information.
Step 120, determining a sound source region of the voice according to the plurality of sub-voice information.
The sound source region is a region where a sound source emitting the voice is located, and the sound source region is a sound receiving region or a critical region, for example, a region where a passenger emitting the voice is located.
Since the blind source separation technology requires a large phase difference between a plurality of sub-voice information, and when a sound source is located in a critical area, the phase difference between the plurality of sub-voice information is small, which easily causes inaccuracy of separated voices, the sound source area needs to be judged before the voices are separated.
In some embodiments, step 120 comprises:
a substep 121 of determining a sound source position of the speech from the plurality of sub-speech information.
A sub-step 122 of determining a sound source region from the sound source position.
In practical application, the position included in the sound receiving area and the position included in the critical area in the scene can be determined and stored in advance, and after the sound source position of the voice is determined according to the related method of sound source positioning, whether the sound source area is the critical area or the sound receiving area can be determined by inquiring the position included in the pre-stored sound receiving area and the position included in the critical area.
The sound source position of the voice is determined according to the plurality of sub-voice information, and then the sound source area is determined according to the sound source position, so that the sound source area can be determined after the sub-voice information is acquired, and a judgment basis is provided for the selection of a subsequent voice information processing method.
In some embodiments, the sub-voice information includes time information characterizing a time at which the microphone acquired the sub-voice information, and the sub-step 121 determines a sound source location of the voice from the plurality of sub-voice information includes:
and determining the sound source position of the voice according to the time difference of the plurality of time information.
The voice sent by the passenger respectively reaches different microphones at the same propagation speed, and the distance between the passenger and the different microphones is different, so that the time of the voice reaching the microphones is different, and the distance difference between the passenger and the different microphones can be determined according to the time difference of the voice reaching the microphones and the propagation speed of the sound, so that the position of the passenger, namely the position of the sound source, is determined.
In some embodiments, the sub-voice information includes sound intensity information, the sound intensity information characterizes sound intensity of the sub-voice information, and the sub-step 121 determines a sound source position of the voice according to the plurality of sub-voice information, including:
and determining the sound source position of the voice according to the sound intensity differences of the sound intensity information.
After the passenger gives out the voice, the voice propagates in the air, and along with the increase of the propagation distance, the sound intensity of the voice is gradually attenuated, and the relation between the sound intensity attenuation and the propagation distance can be obtained through experiments. Therefore, after the microphone acquires the sound intensity of the voice, the distance between the microphone and the passenger can be reversely obtained, and the position of the passenger is further determined; alternatively, the distance difference between the passenger and the different microphones may be determined according to the sound intensity difference of the different sub-voice information, so as to determine the position of the passenger, i.e. the sound source position.
In some embodiments, after sub-step 121, before sub-step 122, further comprising:
and obtaining the confidence of the sound source position.
And if the confidence coefficient is smaller than a preset threshold value, acquiring video information.
Lip movement information in the video information is identified.
Wherein the confidence characterizes the accuracy of the sound source location. In the sound source position determining process, the determined sound source position may be inaccurate, or a plurality of sound source positions may be determined, that is, the confidence is smaller than a preset threshold. At this time, the sound source position needs to be judged by other auxiliary means so as to improve the confidence of the sound source position. In practical applications, a video acquisition device (e.g., an infrared video acquisition device) may be provided in the vehicle to acquire video information (infrared video information) in the vehicle, and then identify lip movement information in the video information, and determine a sound source region based on the lip movement information and the sound source position. For example, two sound source positions are acquired, one is in the main driving position, the other is in the auxiliary driving position, and in the acquired video information, only the region corresponding to the main driving position detects lip movement information, and at this time, the sound source position is determined to be located in the main driving position.
The judgment of the sound source position may be inaccurate or a plurality of sound source positions are determined, and when a passenger sounds, lip movement is usually accompanied, so that the judgment of the sound source region according to the lip movement information can improve the judgment accuracy of the sound source region.
Optionally, a pressure sensor may be further disposed on the driver seat, and the sound source area may be determined according to the value of the pressure sensor and the sound source position.
For example, two sound source positions are determined, one in the main driving position and the other in the co-driving position, and the pressure sensor in the main driving position detects the presence of a passenger, and the pressure sensor in the co-driving position does not detect the presence of a passenger, so that the sound source position can be determined in the main driving position.
It will be appreciated that the sound source region of the speech may also be determined directly from the sub-speech information. For example, when the phase difference of the sub-voice information is greater than or equal to a first preset threshold value, the sound source area is judged to be a sound receiving area, and when the phase difference of the sub-voice information is smaller than the first preset threshold value, the sound source area is judged to be a critical area; or when the intensity difference of the sub-voice information is larger than or equal to a second preset threshold value, judging the sound source area as a sound receiving area, and when the intensity difference of the sub-voice information is smaller than the second preset threshold value, judging the sound source area as a critical area; or when the time difference of receiving the sub-voice information is larger than or equal to a third preset threshold value, judging the sound source area as a sound receiving area, and when the time difference of receiving the sub-voice information is smaller than the third preset threshold value, judging the sound source area as a critical area.
And 130, obtaining target voice information according to the sound source area and the plurality of sub-voice information.
The target voice information is voice information of a subsequent need of voice to text, and the target voice information is obtained by processing a plurality of sub-voice information through a target voice generation method. According to the different sound source areas, different target voice generating methods are selected, so that different target voice information is obtained. When the sound source area is the sound receiving area, selecting a target voice generating method corresponding to the sound receiving area to obtain target voice information corresponding to the sound receiving area; when the sound source area is a critical area, selecting a target voice generating method corresponding to the critical area to obtain target voice information corresponding to the critical area.
In practical application, a plurality of sub-voice information can be synthesized into one piece of integral voice information, and then different target voice generating methods are selected according to different sound source areas.
And 140, obtaining text information according to the target voice information.
The method for obtaining text information according to the target voice information in the present application may refer to the content of the related art, for example, through an automatic voice recognition technology (Automatic Speech Recognition, abbreviated as ASR) and the like, and will not be described herein.
According to the voice processing method provided by the embodiment of the application, the plurality of sub-voice information is acquired, the sound source area of voice is determined according to the plurality of sub-voice information, and the target voice information is obtained according to the sound source area and the plurality of sub-voice information, so that the text information is obtained according to the target voice information. That is, the target voice information for converting into text information is generated according to the sound source area, different target voice information is generated according to different sound source areas, and the influence of the position of the sound source on the voice information processing result is fully considered, so that the problem that the voice information separated by the blind source separation technology is inaccurate due to the fact that the phase difference of sub-voice information is small in the prior art is solved.
In some embodiments, step 140 obtains target speech information from the sound source region and the plurality of sub-speech information, including:
in the substep 201, if the sound source region is the critical region, the plurality of sub-voice information is synthesized into the target voice information.
When the sound source is located in the critical area, it means that the phase difference between the sub-voice information received by the microphones is smaller, if the blind source separation technology is adopted to separate the whole voice information synthesized by the sub-voice information, the separated voice information tends to be inaccurate. Therefore, when the sound source area is the critical area, the whole car radio mode can be adopted to synthesize a plurality of pieces of sub-voice information into target voice information, namely, the whole voice information is determined to be the target voice information, so that the integrity of the voice information can be ensured, and the phenomenon of word loss and the like of the separated voice information can be prevented.
In some embodiments, step 140 obtains target speech information from the sound source region and the plurality of sub-speech information, including:
in step 301, if the sound source area is the sound receiving area, the plurality of sub-voice information are synthesized into the voice information to be separated.
In a substep 302, the target speech information is separated from the speech information to be separated by using a blind source separation method.
When the sound source is positioned in the sound receiving area, the phase difference between the sub-voice information received by the microphones is larger, the condition of the blind source separation technology is met, the whole voice information synthesized by the sub-voice information is separated by the blind source separation technology, and the accuracy of the separated voice information is higher.
The method for separating the voice by using the blind source separation method may refer to the content of the related art, and will not be described herein.
In some embodiments, after sub-step 121, further comprising:
and controlling the virtual object to face the sound source position, and playing the feedback voice corresponding to the voice information.
After the vehicle receives and processes the voice information, corresponding feedback is often required according to the content of the voice information. For example, when a passenger gives a voice command of "go to nearby gas stations", the car system selects several nearby gas stations according to the voice command, and then gives a voice to the passenger to play the selected gas stations. Further, the virtual object can be displayed on a display screen of the vehicle machine system, and the virtual object can be positioned in a virtual animation image. When the vehicle plays voice, the virtual object can face the sound source position, so that passengers have a conversation experience with people.
The application also provides a voice processing device, which comprises a processor and a plurality of microphones.
The microphones are arranged at intervals, each microphone corresponds to a sound receiving area, areas among the sound receiving areas are critical areas, the microphones are used for acquiring a plurality of pieces of sub-voice information, and the plurality of pieces of sub-voice information are voice information which is acquired by the plurality of different microphones for the same voice respectively.
A processor electrically connected with the plurality of microphones for determining a sound source area of the voice according to the plurality of sub-voice information; the sound source area is a sound receiving area or a critical area; obtaining target voice information according to the sound source area and the plurality of sub-voice information; and obtaining text information according to the target voice information.
According to the voice processing device provided by the embodiment of the application, the plurality of sub-voice information is acquired, the sound source area of voice is determined according to the plurality of sub-voice information, and the target voice information is obtained according to the sound source area and the plurality of sub-voice information, so that the text information is obtained according to the target voice information. That is, the target voice information for converting into text information is generated according to the sound source area, different target voice information is generated according to different sound source areas, and the influence of the position of the sound source on the voice information processing result is fully considered, so that the problem that the voice information separated by the blind source separation technology is inaccurate due to the fact that the phase difference of sub-voice information is small in the prior art is solved.
The device in the embodiment of the application can be a device, and also can be a component, an integrated circuit or a chip in a terminal. The device may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a cell phone, tablet computer, notebook computer, palm computer, vehicle mounted electronic device, wearable device, ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook or personal digital assistant (personal digital assistant, PDA), etc., and the non-mobile electronic device may be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and embodiments of the present application are not limited in particular.
The device in the embodiment of the application can be a device with an operating system. The operating system may be an Android operating system, an iOS operating system, or other possible operating systems, and the embodiment of the present application is not limited specifically.
The device provided by the embodiment of the application can realize each process realized by the embodiment of the method, and in order to avoid repetition, the description is omitted here.
Fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.
As shown in fig. 5, the embodiment of the present application further provides an electronic device M00, which includes a processor M01, a memory M02, and a program or an instruction stored in the memory M02 and capable of running on the processor M01, where the program or the instruction implements each process of the above method embodiment when executed by the processor M01, and the process can achieve the same technical effect, and for avoiding repetition, a description is omitted herein.
The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.
Fig. 6 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.
The electronic device 1000 includes, but is not limited to: radio frequency unit 1001, network module 1002, audio output unit 1003, input unit 1004, sensor 1005, display unit 1006, user input unit 1007, interface unit 1008, memory 1009, and processor 1010.
Those skilled in the art will appreciate that the electronic device 1000 may also include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 1010 by a power management system to perform functions such as managing charging, discharging, and power consumption by the power management system. The electronic device structure shown in fig. 6 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.
It should be appreciated that in an embodiment of the present application, the input unit 1004 may include a graphics processor (Graphics Processing Unit, GPU) 10041 and a microphone 10042, and the graphics processor 10041 processes image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The display unit 1006 may include a display panel 10061, and the display panel 10061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1007 includes a touch panel 10071 and other input devices 10072. The touch panel 10071 is also referred to as a touch screen. The touch panel 10071 can include two portions, a touch detection device and a touch controller. Other input devices 10072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein. Memory 1009 may be used to store software programs as well as various data including, but not limited to, application programs and an operating system. The processor 1010 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1010.
The embodiment of the application also provides a readable storage medium, on which a program or an instruction is stored, which when executed by a processor, implements each process of the above method embodiment, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here.
Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium such as a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
The embodiment of the application further provides a chip, which comprises a processor and a communication interface, wherein the communication interface is coupled with the processor, and the processor is used for running programs or instructions to realize the processes of the embodiment of the method, and can achieve the same technical effects, so that repetition is avoided, and the description is omitted here.
It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.
Finally, it should be noted that the described embodiments are some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the application.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (12)

1. A voice information processing method, comprising:
acquiring a plurality of sub-voice information; the plurality of sub-voice information is voice information respectively acquired by a plurality of different microphones aiming at the same voice, the plurality of different microphones are arranged at intervals, each microphone corresponds to a sound receiving area, and areas among the plurality of sound receiving areas are critical areas;
determining a sound source area of the voice according to the plurality of sub-voice information; the sound source area is the sound receiving area or the critical area;
obtaining target voice information according to the sound source area and the plurality of sub-voice information;
and obtaining text information according to the target voice information.
2. The method of claim 1, wherein the obtaining target speech information from the sound source region and the plurality of sub-speech information comprises:
and if the sound source area is the critical area, synthesizing the plurality of sub-voice information into the target voice information.
3. The method of claim 1, wherein the obtaining target speech information from the sound source region and the plurality of sub-speech information comprises:
if the sound source area is the sound receiving area, synthesizing the plurality of sub-voice information into voice information to be separated;
and separating the target voice information from the voice information to be separated by using a blind source separation method.
4. The method of claim 1, wherein said determining a sound source region of said speech from said plurality of sub-speech information comprises:
determining a sound source position of the voice according to the plurality of sub-voice information;
and determining the sound source area according to the sound source position.
5. The method of claim 4, wherein the sub-voice information includes time information characterizing a time at which the microphone acquired the sub-voice information, the determining the sound source location of the voice from the plurality of sub-voice information comprising:
and determining the sound source position of the voice according to the time difference of the plurality of time information.
6. The method of claim 4, wherein the sub-speech information includes sound intensity information characterizing sound intensity of the sub-speech information, and wherein determining a sound source position of the speech from the plurality of sub-speech information comprises:
and determining the sound source position of the voice according to the sound intensity differences of the sound intensity information.
7. The method of claim 4, further comprising, after said determining a sound source location of said speech from said plurality of sub-speech information, before said determining said sound source region from said sound source location:
acquiring the confidence coefficient of the sound source position; the confidence characterizes the accuracy of the sound source position;
if the confidence coefficient is smaller than a preset threshold value, acquiring video information;
identifying lip movement information in the video information;
the determining the sound source area according to the sound source position includes:
and determining the sound source area according to the lip movement information and the sound source position.
8. The method of claim 4, further comprising, after said determining a sound source position of said speech from said plurality of sub-speech information:
and controlling the virtual object to face the sound source position, and playing the feedback voice corresponding to the voice information.
9. The method according to any one of claims 1-8, wherein the method is applied to a vehicle, a plurality of seating areas are arranged in the vehicle, a microphone is arranged corresponding to each seating area, a sound receiving area of the microphone is a seating area corresponding to the microphone, and the critical area is an area among the plurality of seating areas.
10. A voice information processing apparatus, comprising:
the microphones are arranged at intervals, each microphone corresponds to a sound receiving area, areas among the sound receiving areas are critical areas, the microphones are used for acquiring a plurality of pieces of sub-voice information, and the plurality of pieces of sub-voice information are voice information respectively acquired by the plurality of different microphones for the same voice;
a processor electrically connected with the plurality of microphones for determining a sound source area of the voice according to the plurality of sub-voice information; the sound source area is the sound receiving area or the critical area; obtaining target voice information according to the sound source area and the plurality of sub-voice information; and obtaining text information according to the target voice information.
11. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, which when executed by the processor, implements the method of any of claims 1-9.
12. A readable storage medium storing a program or instructions which when executed by a processor performs the method of any one of claims 1-9.
CN202210183336.0A 2022-02-24 2022-02-24 Voice information processing method and device, electronic equipment and readable storage medium Pending CN116705027A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210183336.0A CN116705027A (en) 2022-02-24 2022-02-24 Voice information processing method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210183336.0A CN116705027A (en) 2022-02-24 2022-02-24 Voice information processing method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN116705027A true CN116705027A (en) 2023-09-05

Family

ID=87828084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210183336.0A Pending CN116705027A (en) 2022-02-24 2022-02-24 Voice information processing method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN116705027A (en)

Similar Documents

Publication Publication Date Title
US11670302B2 (en) Voice processing method and electronic device supporting the same
US10733987B1 (en) System and methods for providing unplayed content
JP6515764B2 (en) Dialogue device and dialogue method
US11468888B2 (en) Control apparatus, control method agent apparatus, and computer readable storage medium
CN111661068B (en) Agent device, method for controlling agent device, and storage medium
CN108108142A (en) Voice information processing method, device, terminal device and storage medium
WO2017057173A1 (en) Interaction device and interaction method
CN108962241B (en) Position prompting method and device, storage medium and electronic equipment
CN107919138B (en) Emotion processing method in voice and mobile terminal
CN109903758B (en) Audio processing method and device and terminal equipment
CN111477225A (en) Voice control method and device, electronic equipment and storage medium
KR20150004122A (en) Server and control method thereof, and image processing apparatus and control method thereof
CN112185388B (en) Speech recognition method, device, equipment and computer readable storage medium
CN114678021A (en) Audio signal processing method and device, storage medium and vehicle
CN111192583B (en) Control device, agent device, and computer-readable storage medium
US20200143810A1 (en) Control apparatus, control method, agent apparatus, and computer readable storage medium
JP2004213175A (en) Information communication system
CN111341317B (en) Method, device, electronic equipment and medium for evaluating wake-up audio data
CN116705027A (en) Voice information processing method and device, electronic equipment and readable storage medium
KR20190074344A (en) Dialogue processing apparatus and dialogue processing method
US10997442B2 (en) Control apparatus, control method, agent apparatus, and computer readable storage medium
CN113535308A (en) Language adjusting method, language adjusting device, electronic equipment and medium
CN113076077A (en) Method, device and equipment for installing vehicle-mounted program
US20220020374A1 (en) Method, device, and program for customizing and activating a personal virtual assistant system for motor vehicles
CN112866480A (en) Information processing method, information processing device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination