CN111145753A - Voice processing method, device and system - Google Patents

Voice processing method, device and system Download PDF

Info

Publication number
CN111145753A
CN111145753A CN201811302321.1A CN201811302321A CN111145753A CN 111145753 A CN111145753 A CN 111145753A CN 201811302321 A CN201811302321 A CN 201811302321A CN 111145753 A CN111145753 A CN 111145753A
Authority
CN
China
Prior art keywords
voice data
recognized
sound source
data
positioning result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811302321.1A
Other languages
Chinese (zh)
Inventor
杨茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN201811302321.1A priority Critical patent/CN111145753A/en
Publication of CN111145753A publication Critical patent/CN111145753A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the application provides a voice processing method, a device and a system, wherein the method comprises the following steps: carrying out sound source positioning on the voice data, and marking character data converted from the voice data based on a positioning result; that is, the contents published by different people are distinguished by the positions of different people, and the calculation amount is not large.

Description

Voice processing method, device and system
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, and a system for processing speech.
Background
In various scenes such as conferences and classes, voice data sent by personnel can be converted into character data through a voice processing scheme, and conference records or class records are obtained; compared with a manual recording mode, the method saves labor and improves the recording accuracy.
However, machines are often difficult to distinguish between the contents of different people.
Disclosure of Invention
An object of the embodiments of the present application is to provide a method, an apparatus, and a system for processing a voice, so as to solve the problem that contents published by different people are difficult to distinguish in a process of converting a voice into a text.
In order to achieve the above object, an embodiment of the present application provides a speech processing method, including:
acquiring voice data to be recognized;
carrying out sound source positioning on the voice data to be recognized to obtain a positioning result;
converting the voice data to be recognized into character data;
and marking the character data based on the positioning result.
Optionally, the acquiring the voice data to be recognized includes: acquiring voice data acquired by a microphone array as voice data to be recognized;
the sound source positioning is carried out on the voice data to be recognized to obtain a positioning result, and the positioning result comprises the following steps:
and comparing the voice data collected by each microphone in the microphone array to obtain a positioning result of the voice data to be recognized.
Optionally, the converting the voice data to be recognized into text data includes:
converting any one or more collected voice data in the microphone array into text data;
or, according to the positioning result, performing beam forming on the voice data acquired by the microphone array to obtain enhanced voice data, and converting the enhanced voice data into text data.
Optionally, the sound source positioning is performed on the voice data to be recognized to obtain a positioning result, including:
determining the angle of the sound source of the voice data to be recognized relative to sound collection equipment as a sound source positioning result of the voice data to be recognized;
the marking the text data based on the positioning result comprises:
and marking the character data by taking the angle as a label.
Optionally, the sound source positioning is performed on the voice data to be recognized to obtain a positioning result, including:
determining the angle of a sound source of the voice data to be recognized relative to sound collection equipment as a sound source angle;
searching a seat identifier corresponding to the sound source angle in a mapping relation between a seat identifier and an angle which is established in advance, and using the seat identifier as a sound source positioning result of the voice data to be recognized;
the marking the text data based on the positioning result comprises:
and marking the character data by taking the searched seat identification as a label.
Optionally, after the marking the text data based on the positioning result, the method further includes:
and correspondingly storing the voice data to be recognized and the marked character data to obtain a content record.
In order to achieve the above object, an embodiment of the present application further provides a speech processing apparatus, including:
the acquisition module is used for acquiring voice data to be recognized;
the positioning module is used for positioning a sound source of the voice data to be recognized to obtain a positioning result;
the conversion module is used for converting the voice data to be recognized into character data;
and the marking module is used for marking the character data based on the positioning result.
Optionally, the obtaining module is specifically configured to: acquiring voice data acquired by a microphone array as voice data to be recognized;
the positioning module is specifically configured to: and comparing the voice data collected by each microphone in the microphone array to obtain a positioning result of the voice data to be recognized.
Optionally, the conversion module is specifically configured to:
converting any one or more collected voice data in the microphone array into text data; or, according to the positioning result, performing beam forming on the voice data acquired by the microphone array to obtain enhanced voice data, and converting the enhanced voice data into text data.
Optionally, the positioning module is specifically configured to: determining the angle of the sound source of the voice data to be recognized relative to sound collection equipment as a sound source positioning result of the voice data to be recognized;
the marking module is specifically configured to: and marking the character data by taking the angle as a label.
Optionally, the positioning module is specifically configured to: determining the angle of a sound source of the voice data to be recognized relative to sound collection equipment as a sound source angle; searching a seat identifier corresponding to the sound source angle in a mapping relation between a seat identifier and an angle which is established in advance, and using the seat identifier as a sound source positioning result of the voice data to be recognized;
the marking module is specifically configured to: and marking the character data by taking the searched seat identification as a label.
Optionally, the apparatus further comprises:
and the storage module is used for correspondingly storing the voice data to be recognized and the marked character data to obtain a content record.
In order to achieve the above object, an embodiment of the present application further provides an electronic device, including a processor and a memory;
a memory for storing a computer program;
and a processor for implementing any of the above-described speech processing methods when executing the program stored in the memory.
In order to achieve the above object, an embodiment of the present application further provides a speech processing system, including: a sound collection device and a voice processing device; wherein the content of the first and second substances,
the voice acquisition equipment is used for acquiring voice data and sending the voice data to the voice processing equipment;
the voice processing device is used for receiving the voice data as voice data to be recognized; carrying out sound source positioning on the voice data to be recognized to obtain a positioning result; converting the voice data to be recognized into character data; and marking the character data based on the positioning result.
The embodiment of the application is applied to voice processing, sound source positioning is carried out on voice data, and character data converted from the voice data are marked based on the positioning result, namely contents (character data) published by different personnel are distinguished according to the positions of the different personnel, and the calculation amount is small.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a first flowchart of a speech processing method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a first scenario provided in the embodiment of the present application;
fig. 3 is a schematic diagram of a second scenario provided in the embodiment of the present application;
fig. 4a is a schematic diagram of a third scenario provided in the embodiment of the present application;
fig. 4b is a schematic diagram of a fourth scenario provided in the embodiment of the present application;
fig. 5 is a second flowchart of a speech processing method according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a speech processing system according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to solve the foregoing technical problems, embodiments of the present application provide a method, an apparatus, and a system for processing a voice, where the method and the apparatus may be applied to a voice processing device, or may also be applied to a sound collection device, and are not limited specifically.
First, a speech processing method provided in an embodiment of the present application is described in detail below. Fig. 1 is a first flowchart of a speech processing method according to an embodiment of the present application, including:
s101: and acquiring voice data to be recognized.
The embodiment of the application can be applied to various scenes such as conferences, classes and the like. Taking a conference scene as an example, a sound collection device may be disposed in a conference room to collect voice data of conference participants. In one case, the scheme can be executed while the conference is in progress, and voice data is collected and recognized in real time. In another case, only voice data may be recorded while the conference is in progress, and after the conference is ended, the recorded voice data is identified by executing the scheme.
For example, in a classroom setting, a sound collection device may be provided in a classroom to collect voice data of teachers and students. In one case, the scheme can be executed in class, with voice data collected and identified in real time. In another case, only voice data can be recorded during class break, and the recorded voice data can be identified by executing the scheme after class break.
S102: and carrying out sound source positioning on the voice data to be recognized to obtain a positioning result.
As an embodiment, the sound collecting device disposed in the scene may be a microphone array, and thus, S101 includes: and acquiring voice data acquired by the microphone array as voice data to be recognized. In this case, the positioning result of the voice data to be recognized may be obtained by comparing the voice data collected by each microphone in the microphone array.
The number of microphones in the microphone array is not limited specifically, and may be, for example, 4, 6, or 8, etc. The array shape of the microphone array is not limited, for example, the microphone array may be a linear array, a circular array, a distributed array, or the like.
Sound source localization has various modes, for example, a DOA (direction-of-arrival) estimation algorithm can be adopted to perform sound source localization; the sound source can be positioned according to the difference between the different moments when the sound emitted by the sound source reaches different microphones.
Alternatively, sound source localization may be performed in other manners, such as sound source localization based on high resolution spectra, sound source localization based on steerable beams, and so forth, to name but a few.
In one case, S102 may include: and determining the angle of the sound source of the voice data to be recognized relative to the sound acquisition equipment, and taking the angle as the sound source positioning result of the voice data to be recognized.
For example, assuming that the sound collection device is a microphone array shaped as a linear array, as shown in fig. 2, a connection line l of a sound source to the microphone 3 at the center of the linear array can be determined1Line l with the linear array2Angle of (theta)1By theta1Indicating the result of the localization of the sound source. Alternatively, the line l connecting the sound source to the farthest microphone 5 (farthest from the sound source) in the linear array may be determined3Line l with the linear array2Angle of (theta)2By theta2Indicating the result of the localization of the sound source. Alternatively, the line l connecting the sound source to the closest microphone 1 (closest to the sound source) in the line array may be determined4Line l with the linear array2Angle of (theta)3By theta3Indicating the result of the localization of the sound source. The positioning result can also be the connecting line and the straight line l of the sound source and other microphones2The included angles are not listed.
Alternatively, the straight line l on which the linear array is positioned can be made2Is determined by the perpendicular ll1And the angle between the sound source and the l' is used for representing the positioning result of the sound source. The angles included in the positioning result are only used for representing the relative position of the sound source and the sound collecting device, and are not listed.
For another example, assuming that the sound collecting device is a microphone array with a circular shape, as shown in fig. 3, a connection line l passing through the sound source and the center of the circle can be formed5Is prepared by5And a diameter l of a circle6Angle of (theta)4To represent the localization result of the sound source.
As another example, the sound collecting device may also be a microphone array in the shape of a distributed array, as shown in fig. 4b, the microphone array includes two microphones arranged in two directions: the microphones 1-4 arranged in the direction 1 and the microphones 5-8 arranged in the direction 2, so that the positioning result of the sound source can be represented by two angles, one direction corresponds to one angle, and the positioning is more accurate. Alternatively, the distributed array may include more directionally arranged microphones, not to mention one.
In addition, in some cases, through the various sound source positioning modes, not only the angle information of the sound source relative to the sound collection equipment can be determined, but also the distance information of the sound source relative to the sound collection equipment can be determined, so that the sound source is positioned more accurately.
S103: and converting the voice data to be recognized into character data.
In this embodiment, the execution sequence of S102 and S103 is not limited, and S102 may be executed first and then S103 may be executed, or S103 may be executed first and then S102 may be executed, or S102 and S103 may be executed simultaneously.
As described above, the sound collection device disposed in the scene may be a microphone array, in which case, in one embodiment, the voice data collected by any one or more microphones in the microphone array may be converted into text data; in another embodiment, the voice data acquired by the microphone array may be subjected to beam forming according to the positioning result to obtain enhanced voice data, and the enhanced voice data may be converted into text data.
Beamforming, i.e. performing weighted synthesis on each path of voice data received by a plurality of microphones in a microphone array, is equivalent to forming a beam in a specified direction, i.e. performing enhancement processing on the voice data in the specified direction. For example, in the example of fig. 3, let l be the connecting line between the sound source and the center of the circle5Then, it is to l5And carrying out enhancement processing on the voice data in the direction and carrying out suppression processing on the voice data in other directions. The beamformed voice data is referred to as enhanced voice data. Performing speech recognition on the enhanced speech data (i.e., converting the enhanced speech data into text data) improves the accuracy of recognition compared to performing speech recognition on speech data that is not beamformed.
S104: and marking the character data based on the positioning result.
In the above-described embodiment, the positioning result is the angle of the sound source with respect to the sound collection device, and in this case, the character data may be directly marked with the angle as a label.
Or, as another embodiment, an angle of a sound source of the voice data to be recognized with respect to the sound collection device may be determined as a sound source angle; searching a seat identifier corresponding to the sound source angle in a mapping relation between a seat identifier and an angle which is established in advance, and using the seat identifier as a sound source positioning result of the voice data to be recognized; in this case, S104 includes: and marking the character data by taking the searched seat identification as a label.
In a conference, a classroom, and the like, the seat is generally fixed, and therefore, a mapping relationship between the seat identification and the sound source angle can be established in advance. For example, as shown in fig. 4a, each seat represents a sound source position, and the sound collecting device is a circular microphone array; connecting the sound source and the circle center, and connecting the connecting line with a circle with a diameter l7The angle of each sound source is represented by the included angle of (A), and the sound source angle corresponding to the seat 1 is α1Establishing seat and angle α1The mapping between seat 2 and angle α is established similarly2The mapping relationship between the two groups of the data,establishing seat 3 and angle α13Mapping relationship between seat 4 and angle α4The mapping relationship between them.
Assuming that voice data a to be recognized is acquired, it is determined that the angle of the sound source of the voice data a with respect to the microphone array is α2Then, the positioning result of the sound source can be determined as the seat 2 according to the established mapping relationship. Thus, the character data converted from the voice data a can be marked with the seat 2 as a label.
As another example, as shown in fig. 4b, the microphone array in fig. 4b is in the shape of a distributed array, which includes two microphones arranged in two directions: the microphones 1-4 arranged in direction 1 and the microphones 5-8 arranged in direction 2, in which case the sound source localization result can be represented by two angles, one direction for each angle.
For example, a seat 6 is taken as a sound source, a connecting line 1 between the center of the microphone 1-4 in the direction 1 and the sound source can be made, and an included angle between the connecting line 1 and the direction 1 is an angle 1; a connecting line 2 between the center of the microphone 5-8 in the direction 2 and a sound source can be made, and the included angle between the connecting line 2 and the direction 2 is an angle 2; a mapping between (angle 1, angle 2) and seat 6 is established.
The other seats are similar and are not listed. In fig. 4b, two angles are used to represent the sound source positioning result, and the positioning result is more accurate.
By applying the embodiment of the invention, the voice data from different sound sources are marked by different labels, and the voice data of different sound sources are the voice data sent by different personnel, so that on the one hand, the scheme realizes the distinction of the voice data of different personnel, and on the other hand, the scheme distinguishes the voice data of different personnel through sound source positioning, and because the data volume of the sound source positioning result is not high, the calculation amount is reduced integrally; in the second aspect, the scheme does not need to manually collect voice data of each person, so that manpower is saved, and flexibility is improved.
Or, in some scenarios, the corresponding relationship between the seat and the person is also fixed, and in this case, a mapping relationship between the seat identifier and the person identity may be established, so that the person identity may be used as a sound source positioning result, and the person identity is used as a tag to mark the text data.
As an embodiment, after S104, the method may further include: and correspondingly storing the voice data to be recognized and the marked character data to obtain a content record.
If the scheme is applied to a meeting scene, the content is recorded as a meeting record. If the scheme is applied to a classroom scene, the content record is a classroom record. In addition, the content record may further include a time corresponding to the voice data, for example, the content record may be as shown in table 1:
TABLE 1
Figure BDA0001852704860000091
Table 1 is merely an example and does not limit the present invention.
For example, assuming that the person corresponding to the seat 1 is a main person in a conference and then wants to view the contents posted by the main person separately, in this case, the text content corresponding to the seat 1 may be selected for viewing according to the tag.
Fig. 5 is a second flowchart of the speech processing method according to the embodiment of the present application, including:
s501: and acquiring voice data acquired by the microphone array as voice data to be recognized.
The number of microphones in the microphone array is not limited specifically, and may be, for example, 4, 6, or 8, etc. The array shape of the microphone array is not limited, for example, the microphone array may be a linear array, a circular array, a distributed array, or the like.
S502: the method comprises the steps of comparing voice data collected by each microphone in a microphone array, and determining the angle of a sound source of the voice data to be recognized relative to the microphone array as a sound source angle.
S503: and searching a seat identifier corresponding to the sound source angle in a mapping relation between the seat identifier and the angle which is established in advance, and taking the seat identifier as a sound source positioning result of the voice data to be recognized.
In a conference, a classroom, and the like, the seat is generally fixed, and therefore, a mapping relationship between the seat identification and the sound source angle can be established in advance. For example, as shown in fig. 4a, each seat represents a sound source position, and the sound collecting device is a circular microphone array; connecting the sound source and the circle center, and connecting the connecting line with a circle with a diameter l7The angle of each sound source is represented by the included angle of (A), and the sound source angle corresponding to the seat 1 is α1Establishing seat and angle α1The mapping between seat 2 and angle α is established similarly2Mapping relationship between the seat 3 and the angle α13Mapping relationship between seat 4 and angle α4The mapping relationship between them.
Assuming that voice data a to be recognized is acquired, it is determined that the angle of the sound source of the voice data a with respect to the microphone array is α2Then, the positioning result of the sound source can be determined as the seat 2 according to the established mapping relationship.
S504: and according to the positioning result, carrying out beam forming on the voice data acquired by the microphone array to obtain enhanced voice data, and converting the enhanced voice data into character data.
Beamforming, i.e. performing weighted synthesis on each path of voice data received by a plurality of microphones in a microphone array, is equivalent to forming a beam in a specified direction, i.e. performing enhancement processing on the voice data in the specified direction. For example, in the example of fig. 3, let l be the connecting line between the sound source and the center of the circle5Then, it is to l5And carrying out enhancement processing on the voice data in the direction and carrying out suppression processing on the voice data in other directions. The beamformed voice data is referred to as enhanced voice data. Performing speech recognition on the enhanced speech data (i.e., converting the enhanced speech data into text data) improves the accuracy of recognition compared to performing speech recognition on speech data that is not beamformed.
S505: and marking the character data by taking the positioning result as a label.
S506: and correspondingly storing the voice data to be recognized and the marked character data to obtain a content record.
If the scheme is applied to a meeting scene, the content is recorded as a meeting record. If the scheme is applied to a classroom scene, the content record is a classroom record.
By applying the embodiment shown in fig. 5 of the present invention, on the first hand, the voice and the personnel information can be corresponded only by sound source localization, and therefore, the related calculation amount is less; in the second aspect, the voice data is enhanced by performing beam forming on the voice data, and then the enhanced voice data is converted into text data, so that the conversion effect is improved; in the third aspect, the seat identification is used as a label to mark the character data, so that which person corresponds to which content can be visually represented; and in the fourth aspect, the voice data and the marked text data are correspondingly stored, so that the completeness of the obtained content record is better.
Corresponding to the foregoing method embodiment, an embodiment of the present application further provides a speech processing apparatus, as shown in fig. 6, including:
an obtaining module 601, configured to obtain voice data to be recognized;
a positioning module 602, configured to perform sound source positioning on the voice data to be recognized to obtain a positioning result;
a conversion module 603, configured to convert the voice data to be recognized into text data;
a marking module 604, configured to mark the text data based on the positioning result.
As an embodiment, the obtaining module 601 may be specifically configured to: acquiring voice data acquired by a microphone array as voice data to be recognized;
the positioning module 602 may be specifically configured to: and comparing the voice data collected by each microphone in the microphone array to obtain a positioning result of the voice data to be recognized.
As an implementation manner, the conversion module 603 is specifically configured to:
converting any one or more collected voice data in the microphone array into text data; or, according to the positioning result, performing beam forming on the voice data acquired by the microphone array to obtain enhanced voice data, and converting the enhanced voice data into text data.
As an embodiment, the positioning module 602 is specifically configured to: determining the angle of the sound source of the voice data to be recognized relative to sound collection equipment as a sound source positioning result of the voice data to be recognized;
the marking module 604 is specifically configured to: and marking the character data by taking the angle as a label.
As an embodiment, the positioning module 602 is specifically configured to: determining the angle of a sound source of the voice data to be recognized relative to sound collection equipment as a sound source angle; searching a seat identifier corresponding to the sound source angle in a mapping relation between a seat identifier and an angle which is established in advance, and using the seat identifier as a sound source positioning result of the voice data to be recognized;
the marking module 604 is specifically configured to: and marking the character data by taking the searched seat identification as a label.
As an embodiment, the apparatus further comprises:
and a storage module (not shown in the figure) for correspondingly storing the voice data to be recognized and the marked text data to obtain a content record.
The embodiment shown in fig. 6 of the invention is applied to carry out voice processing, carry out sound source positioning on voice data, and mark character data converted from the voice data based on the positioning result, namely, distinguish the contents (character data) published by different personnel according to the positions of the different personnel.
Embodiments of the present application also provide an electronic device, as shown in fig. 7, including a processor 701 and a memory 702,
a memory 702 for storing a computer program;
the processor 701 is configured to implement any of the above-described speech processing methods when executing the program stored in the memory 702.
The Memory mentioned in the above electronic device may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
An embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements any one of the above-mentioned speech processing methods.
An embodiment of the present application further provides a speech processing system, as shown in fig. 8, including: a sound collection device and a voice processing device; wherein the content of the first and second substances,
the voice acquisition equipment is used for acquiring voice data and sending the voice data to the voice processing equipment;
the voice processing device is used for receiving the voice data as voice data to be recognized; carrying out sound source positioning on the voice data to be recognized to obtain a positioning result; converting the voice data to be recognized into character data; and marking the character data based on the positioning result.
The speech processing apparatus may perform any of the speech processing methods described above.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, the electronic device embodiment, the computer-readable storage medium embodiment and the system embodiment, since they are substantially similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (13)

1. A method of speech processing, comprising:
acquiring voice data to be recognized;
carrying out sound source positioning on the voice data to be recognized to obtain a positioning result;
converting the voice data to be recognized into character data;
and marking the character data based on the positioning result.
2. The method of claim 1, wherein the obtaining voice data to be recognized comprises: acquiring voice data acquired by a microphone array as voice data to be recognized;
the sound source positioning is carried out on the voice data to be recognized to obtain a positioning result, and the positioning result comprises the following steps:
and comparing the voice data collected by each microphone in the microphone array to obtain a positioning result of the voice data to be recognized.
3. The method of claim 2, wherein converting the speech data to be recognized into text data comprises:
converting any one or more collected voice data in the microphone array into text data;
or, according to the positioning result, performing beam forming on the voice data acquired by the microphone array to obtain enhanced voice data, and converting the enhanced voice data into text data.
4. The method according to claim 1, wherein the performing sound source localization on the speech data to be recognized to obtain a localization result comprises:
determining the angle of the sound source of the voice data to be recognized relative to sound collection equipment as a sound source positioning result of the voice data to be recognized;
the marking the text data based on the positioning result comprises:
and marking the character data by taking the angle as a label.
5. The method according to claim 1, wherein the performing sound source localization on the speech data to be recognized to obtain a localization result comprises:
determining the angle of a sound source of the voice data to be recognized relative to sound collection equipment as a sound source angle;
searching a seat identifier corresponding to the sound source angle in a mapping relation between a seat identifier and an angle which is established in advance, and using the seat identifier as a sound source positioning result of the voice data to be recognized;
the marking the text data based on the positioning result comprises:
and marking the character data by taking the searched seat identification as a label.
6. The method of claim 1, further comprising, after said marking the text data based on the positioning result:
and correspondingly storing the voice data to be recognized and the marked character data to obtain a content record.
7. A speech processing apparatus, comprising:
the acquisition module is used for acquiring voice data to be recognized;
the positioning module is used for positioning a sound source of the voice data to be recognized to obtain a positioning result;
the conversion module is used for converting the voice data to be recognized into character data;
and the marking module is used for marking the character data based on the positioning result.
8. The apparatus of claim 7, wherein the obtaining module is specifically configured to: acquiring voice data acquired by a microphone array as voice data to be recognized;
the positioning module is specifically configured to: and comparing the voice data collected by each microphone in the microphone array to obtain a positioning result of the voice data to be recognized.
9. The apparatus of claim 8, wherein the conversion module is specifically configured to:
converting any one or more collected voice data in the microphone array into text data; or, according to the positioning result, performing beam forming on the voice data acquired by the microphone array to obtain enhanced voice data, and converting the enhanced voice data into text data.
10. The apparatus according to claim 7, wherein the positioning module is specifically configured to: determining the angle of the sound source of the voice data to be recognized relative to sound collection equipment as a sound source positioning result of the voice data to be recognized;
the marking module is specifically configured to: and marking the character data by taking the angle as a label.
11. The apparatus according to claim 7, wherein the positioning module is specifically configured to: determining the angle of a sound source of the voice data to be recognized relative to sound collection equipment as a sound source angle; searching a seat identifier corresponding to the sound source angle in a mapping relation between a seat identifier and an angle which is established in advance, and using the seat identifier as a sound source positioning result of the voice data to be recognized;
the marking module is specifically configured to: and marking the character data by taking the searched seat identification as a label.
12. The apparatus of claim 7, further comprising:
and the storage module is used for correspondingly storing the voice data to be recognized and the marked character data to obtain a content record.
13. A speech processing system, comprising: a sound collection device and a voice processing device; wherein the content of the first and second substances,
the voice acquisition equipment is used for acquiring voice data and sending the voice data to the voice processing equipment;
the voice processing device is used for receiving the voice data as voice data to be recognized; carrying out sound source positioning on the voice data to be recognized to obtain a positioning result; converting the voice data to be recognized into character data; and marking the character data based on the positioning result.
CN201811302321.1A 2018-11-02 2018-11-02 Voice processing method, device and system Pending CN111145753A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811302321.1A CN111145753A (en) 2018-11-02 2018-11-02 Voice processing method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811302321.1A CN111145753A (en) 2018-11-02 2018-11-02 Voice processing method, device and system

Publications (1)

Publication Number Publication Date
CN111145753A true CN111145753A (en) 2020-05-12

Family

ID=70515468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811302321.1A Pending CN111145753A (en) 2018-11-02 2018-11-02 Voice processing method, device and system

Country Status (1)

Country Link
CN (1) CN111145753A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104898091A (en) * 2015-05-29 2015-09-09 复旦大学 Microphone array self-calibration sound source positioning system based on iterative optimization algorithm
CN107124647A (en) * 2017-05-27 2017-09-01 深圳市酷开网络科技有限公司 A kind of panoramic video automatically generates the method and device of subtitle file when recording
CN107799118A (en) * 2016-09-05 2018-03-13 深圳光启合众科技有限公司 Voice directions recognition methods and apparatus and system, home controller
CN108629024A (en) * 2018-05-09 2018-10-09 王泽普 A kind of teaching Work attendance method based on voice recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104898091A (en) * 2015-05-29 2015-09-09 复旦大学 Microphone array self-calibration sound source positioning system based on iterative optimization algorithm
CN107799118A (en) * 2016-09-05 2018-03-13 深圳光启合众科技有限公司 Voice directions recognition methods and apparatus and system, home controller
CN107124647A (en) * 2017-05-27 2017-09-01 深圳市酷开网络科技有限公司 A kind of panoramic video automatically generates the method and device of subtitle file when recording
CN108629024A (en) * 2018-05-09 2018-10-09 王泽普 A kind of teaching Work attendance method based on voice recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈小平: "《无线传感器网络》", 30 April 2017 *

Similar Documents

Publication Publication Date Title
Vera-Diaz et al. Towards end-to-end acoustic localization using deep learning: From audio signals to source position coordinates
CN106657865B (en) Conference summary generation method and device and video conference system
Brandstein A framework for speech source localization using sensor arrays
CN102630385B (en) Method, device and system for audio zooming process within an audio scene
CN102763432B (en) Processing of multi-device audio capture
CN101567969B (en) Intelligent video director method based on microphone array sound guidance
WO2018095166A1 (en) Device control method, apparatus and system
CN109308892B (en) Voice synthesis broadcasting method, device, equipment and computer readable medium
CN104246878A (en) Audio user interaction recognition and context refinement
CN103902963A (en) Method and electronic equipment for recognizing orientation and identification
CN109191442B (en) Ultrasonic image evaluation and screening method and device
CN110443371A (en) A kind of artificial intelligence device and method
Gabriel et al. 2D sound source position estimation using microphone arrays and its application to a VR-based bird song analysis system
CN111223107A (en) Point cloud data set manufacturing system and method based on point cloud deep learning
CN113314138B (en) Sound source monitoring and separating method and device based on microphone array and storage medium
CN111145753A (en) Voice processing method, device and system
CN106056503A (en) Intelligent music teaching platform and application method thereof
CN110175260B (en) Method and device for distinguishing recording roles and computer-readable storage medium
CN113643708B (en) Method and device for identifying ginseng voiceprint, electronic equipment and storage medium
KR20190016683A (en) Apparatus for automatic conference notetaking using mems microphone array
CN113611308B (en) Voice recognition method, device, system, server and storage medium
JP2019103011A (en) Converter, conversion method, and program
WO2023010599A1 (en) Target trajectory calibration method based on video and audio, and computer device
CN111492668B (en) Method and system for locating the origin of an audio signal within a defined space
CN113115009A (en) Image transmission method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200512