CN111145753A

CN111145753A - Voice processing method, device and system

Info

Publication number: CN111145753A
Application number: CN201811302321.1A
Authority: CN
Inventors: 杨茜
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2020-05-12

Abstract

The embodiment of the application provides a voice processing method, a device and a system, wherein the method comprises the following steps: carrying out sound source positioning on the voice data, and marking character data converted from the voice data based on a positioning result; that is, the contents published by different people are distinguished by the positions of different people, and the calculation amount is not large.

Description

Voice processing method, device and system

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, and a system for processing speech.

Background

In various scenes such as conferences and classes, voice data sent by personnel can be converted into character data through a voice processing scheme, and conference records or class records are obtained; compared with a manual recording mode, the method saves labor and improves the recording accuracy.

However, machines are often difficult to distinguish between the contents of different people.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method, an apparatus, and a system for processing a voice, so as to solve the problem that contents published by different people are difficult to distinguish in a process of converting a voice into a text.

In order to achieve the above object, an embodiment of the present application provides a speech processing method, including:

acquiring voice data to be recognized;

carrying out sound source positioning on the voice data to be recognized to obtain a positioning result;

converting the voice data to be recognized into character data;

and marking the character data based on the positioning result.

Optionally, the acquiring the voice data to be recognized includes: acquiring voice data acquired by a microphone array as voice data to be recognized;

the sound source positioning is carried out on the voice data to be recognized to obtain a positioning result, and the positioning result comprises the following steps:

and comparing the voice data collected by each microphone in the microphone array to obtain a positioning result of the voice data to be recognized.

Optionally, the converting the voice data to be recognized into text data includes:

converting any one or more collected voice data in the microphone array into text data;

or, according to the positioning result, performing beam forming on the voice data acquired by the microphone array to obtain enhanced voice data, and converting the enhanced voice data into text data.

Optionally, the sound source positioning is performed on the voice data to be recognized to obtain a positioning result, including:

determining the angle of the sound source of the voice data to be recognized relative to sound collection equipment as a sound source positioning result of the voice data to be recognized;

the marking the text data based on the positioning result comprises:

and marking the character data by taking the angle as a label.

determining the angle of a sound source of the voice data to be recognized relative to sound collection equipment as a sound source angle;

searching a seat identifier corresponding to the sound source angle in a mapping relation between a seat identifier and an angle which is established in advance, and using the seat identifier as a sound source positioning result of the voice data to be recognized;

the marking the text data based on the positioning result comprises:

and marking the character data by taking the searched seat identification as a label.

Optionally, after the marking the text data based on the positioning result, the method further includes:

and correspondingly storing the voice data to be recognized and the marked character data to obtain a content record.

In order to achieve the above object, an embodiment of the present application further provides a speech processing apparatus, including:

the acquisition module is used for acquiring voice data to be recognized;

the positioning module is used for positioning a sound source of the voice data to be recognized to obtain a positioning result;

the conversion module is used for converting the voice data to be recognized into character data;

and the marking module is used for marking the character data based on the positioning result.

Optionally, the obtaining module is specifically configured to: acquiring voice data acquired by a microphone array as voice data to be recognized;

the positioning module is specifically configured to: and comparing the voice data collected by each microphone in the microphone array to obtain a positioning result of the voice data to be recognized.

Optionally, the conversion module is specifically configured to:

converting any one or more collected voice data in the microphone array into text data; or, according to the positioning result, performing beam forming on the voice data acquired by the microphone array to obtain enhanced voice data, and converting the enhanced voice data into text data.

Optionally, the positioning module is specifically configured to: determining the angle of the sound source of the voice data to be recognized relative to sound collection equipment as a sound source positioning result of the voice data to be recognized;

the marking module is specifically configured to: and marking the character data by taking the angle as a label.

Optionally, the positioning module is specifically configured to: determining the angle of a sound source of the voice data to be recognized relative to sound collection equipment as a sound source angle; searching a seat identifier corresponding to the sound source angle in a mapping relation between a seat identifier and an angle which is established in advance, and using the seat identifier as a sound source positioning result of the voice data to be recognized;

the marking module is specifically configured to: and marking the character data by taking the searched seat identification as a label.

Optionally, the apparatus further comprises:

and the storage module is used for correspondingly storing the voice data to be recognized and the marked character data to obtain a content record.

In order to achieve the above object, an embodiment of the present application further provides an electronic device, including a processor and a memory;

a memory for storing a computer program;

and a processor for implementing any of the above-described speech processing methods when executing the program stored in the memory.

In order to achieve the above object, an embodiment of the present application further provides a speech processing system, including: a sound collection device and a voice processing device; wherein the content of the first and second substances,

the voice acquisition equipment is used for acquiring voice data and sending the voice data to the voice processing equipment;

the voice processing device is used for receiving the voice data as voice data to be recognized; carrying out sound source positioning on the voice data to be recognized to obtain a positioning result; converting the voice data to be recognized into character data; and marking the character data based on the positioning result.

The embodiment of the application is applied to voice processing, sound source positioning is carried out on voice data, and character data converted from the voice data are marked based on the positioning result, namely contents (character data) published by different personnel are distinguished according to the positions of the different personnel, and the calculation amount is small.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a first flowchart of a speech processing method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a first scenario provided in the embodiment of the present application;

fig. 3 is a schematic diagram of a second scenario provided in the embodiment of the present application;

fig. 4a is a schematic diagram of a third scenario provided in the embodiment of the present application;

fig. 4b is a schematic diagram of a fourth scenario provided in the embodiment of the present application;

fig. 5 is a second flowchart of a speech processing method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a speech processing system according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the foregoing technical problems, embodiments of the present application provide a method, an apparatus, and a system for processing a voice, where the method and the apparatus may be applied to a voice processing device, or may also be applied to a sound collection device, and are not limited specifically.

First, a speech processing method provided in an embodiment of the present application is described in detail below. Fig. 1 is a first flowchart of a speech processing method according to an embodiment of the present application, including:

s101: and acquiring voice data to be recognized.

The embodiment of the application can be applied to various scenes such as conferences, classes and the like. Taking a conference scene as an example, a sound collection device may be disposed in a conference room to collect voice data of conference participants. In one case, the scheme can be executed while the conference is in progress, and voice data is collected and recognized in real time. In another case, only voice data may be recorded while the conference is in progress, and after the conference is ended, the recorded voice data is identified by executing the scheme.

For example, in a classroom setting, a sound collection device may be provided in a classroom to collect voice data of teachers and students. In one case, the scheme can be executed in class, with voice data collected and identified in real time. In another case, only voice data can be recorded during class break, and the recorded voice data can be identified by executing the scheme after class break.

S102: and carrying out sound source positioning on the voice data to be recognized to obtain a positioning result.

As an embodiment, the sound collecting device disposed in the scene may be a microphone array, and thus, S101 includes: and acquiring voice data acquired by the microphone array as voice data to be recognized. In this case, the positioning result of the voice data to be recognized may be obtained by comparing the voice data collected by each microphone in the microphone array.

The number of microphones in the microphone array is not limited specifically, and may be, for example, 4, 6, or 8, etc. The array shape of the microphone array is not limited, for example, the microphone array may be a linear array, a circular array, a distributed array, or the like.

Sound source localization has various modes, for example, a DOA (direction-of-arrival) estimation algorithm can be adopted to perform sound source localization; the sound source can be positioned according to the difference between the different moments when the sound emitted by the sound source reaches different microphones.

Alternatively, sound source localization may be performed in other manners, such as sound source localization based on high resolution spectra, sound source localization based on steerable beams, and so forth, to name but a few.

In one case, S102 may include: and determining the angle of the sound source of the voice data to be recognized relative to the sound acquisition equipment, and taking the angle as the sound source positioning result of the voice data to be recognized.

For example, assuming that the sound collection device is a microphone array shaped as a linear array, as shown in fig. 2, a connection line l of a sound source to the microphone 3 at the center of the linear array can be determined₁Line l with the linear array₂Angle of (theta)₁By theta₁Indicating the result of the localization of the sound source. Alternatively, the line l connecting the sound source to the farthest microphone 5 (farthest from the sound source) in the linear array may be determined₃Line l with the linear array₂Angle of (theta)₂By theta₂Indicating the result of the localization of the sound source. Alternatively, the line l connecting the sound source to the closest microphone 1 (closest to the sound source) in the line array may be determined₄Line l with the linear array₂Angle of (theta)₃By theta₃Indicating the result of the localization of the sound source. The positioning result can also be the connecting line and the straight line l of the sound source and other microphones₂The included angles are not listed.

Alternatively, the straight line l on which the linear array is positioned can be made₂Is determined by the perpendicular ll₁And the angle between the sound source and the l' is used for representing the positioning result of the sound source. The angles included in the positioning result are only used for representing the relative position of the sound source and the sound collecting device, and are not listed.

For another example, assuming that the sound collecting device is a microphone array with a circular shape, as shown in fig. 3, a connection line l passing through the sound source and the center of the circle can be formed₅Is prepared by₅And a diameter l of a circle₆Angle of (theta)₄To represent the localization result of the sound source.

As another example, the sound collecting device may also be a microphone array in the shape of a distributed array, as shown in fig. 4b, the microphone array includes two microphones arranged in two directions: the microphones 1-4 arranged in the direction 1 and the microphones 5-8 arranged in the direction 2, so that the positioning result of the sound source can be represented by two angles, one direction corresponds to one angle, and the positioning is more accurate. Alternatively, the distributed array may include more directionally arranged microphones, not to mention one.

In addition, in some cases, through the various sound source positioning modes, not only the angle information of the sound source relative to the sound collection equipment can be determined, but also the distance information of the sound source relative to the sound collection equipment can be determined, so that the sound source is positioned more accurately.

S103: and converting the voice data to be recognized into character data.

In this embodiment, the execution sequence of S102 and S103 is not limited, and S102 may be executed first and then S103 may be executed, or S103 may be executed first and then S102 may be executed, or S102 and S103 may be executed simultaneously.

As described above, the sound collection device disposed in the scene may be a microphone array, in which case, in one embodiment, the voice data collected by any one or more microphones in the microphone array may be converted into text data; in another embodiment, the voice data acquired by the microphone array may be subjected to beam forming according to the positioning result to obtain enhanced voice data, and the enhanced voice data may be converted into text data.

Beamforming, i.e. performing weighted synthesis on each path of voice data received by a plurality of microphones in a microphone array, is equivalent to forming a beam in a specified direction, i.e. performing enhancement processing on the voice data in the specified direction. For example, in the example of fig. 3, let l be the connecting line between the sound source and the center of the circle₅Then, it is to l₅And carrying out enhancement processing on the voice data in the direction and carrying out suppression processing on the voice data in other directions. The beamformed voice data is referred to as enhanced voice data. Performing speech recognition on the enhanced speech data (i.e., converting the enhanced speech data into text data) improves the accuracy of recognition compared to performing speech recognition on speech data that is not beamformed.

S104: and marking the character data based on the positioning result.

In the above-described embodiment, the positioning result is the angle of the sound source with respect to the sound collection device, and in this case, the character data may be directly marked with the angle as a label.

Or, as another embodiment, an angle of a sound source of the voice data to be recognized with respect to the sound collection device may be determined as a sound source angle; searching a seat identifier corresponding to the sound source angle in a mapping relation between a seat identifier and an angle which is established in advance, and using the seat identifier as a sound source positioning result of the voice data to be recognized; in this case, S104 includes: and marking the character data by taking the searched seat identification as a label.

In a conference, a classroom, and the like, the seat is generally fixed, and therefore, a mapping relationship between the seat identification and the sound source angle can be established in advance. For example, as shown in fig. 4a, each seat represents a sound source position, and the sound collecting device is a circular microphone array; connecting the sound source and the circle center, and connecting the connecting line with a circle with a diameter l₇The angle of each sound source is represented by the included angle of (A), and the sound source angle corresponding to the seat 1 is α₁Establishing seat and angle α₁The mapping between seat 2 and angle α is established similarly₂The mapping relationship between the two groups of the data,establishing seat 3 and angle α₁₃Mapping relationship between seat 4 and angle α₄The mapping relationship between them.

Assuming that voice data a to be recognized is acquired, it is determined that the angle of the sound source of the voice data a with respect to the microphone array is α₂Then, the positioning result of the sound source can be determined as the seat 2 according to the established mapping relationship. Thus, the character data converted from the voice data a can be marked with the seat 2 as a label.

As another example, as shown in fig. 4b, the microphone array in fig. 4b is in the shape of a distributed array, which includes two microphones arranged in two directions: the microphones 1-4 arranged in direction 1 and the microphones 5-8 arranged in direction 2, in which case the sound source localization result can be represented by two angles, one direction for each angle.

For example, a seat 6 is taken as a sound source, a connecting line 1 between the center of the microphone 1-4 in the direction 1 and the sound source can be made, and an included angle between the connecting line 1 and the direction 1 is an angle 1; a connecting line 2 between the center of the microphone 5-8 in the direction 2 and a sound source can be made, and the included angle between the connecting line 2 and the direction 2 is an angle 2; a mapping between (angle 1, angle 2) and seat 6 is established.

The other seats are similar and are not listed. In fig. 4b, two angles are used to represent the sound source positioning result, and the positioning result is more accurate.

By applying the embodiment of the invention, the voice data from different sound sources are marked by different labels, and the voice data of different sound sources are the voice data sent by different personnel, so that on the one hand, the scheme realizes the distinction of the voice data of different personnel, and on the other hand, the scheme distinguishes the voice data of different personnel through sound source positioning, and because the data volume of the sound source positioning result is not high, the calculation amount is reduced integrally; in the second aspect, the scheme does not need to manually collect voice data of each person, so that manpower is saved, and flexibility is improved.

Or, in some scenarios, the corresponding relationship between the seat and the person is also fixed, and in this case, a mapping relationship between the seat identifier and the person identity may be established, so that the person identity may be used as a sound source positioning result, and the person identity is used as a tag to mark the text data.

As an embodiment, after S104, the method may further include: and correspondingly storing the voice data to be recognized and the marked character data to obtain a content record.

If the scheme is applied to a meeting scene, the content is recorded as a meeting record. If the scheme is applied to a classroom scene, the content record is a classroom record. In addition, the content record may further include a time corresponding to the voice data, for example, the content record may be as shown in table 1:

TABLE 1

Table 1 is merely an example and does not limit the present invention.

For example, assuming that the person corresponding to the seat 1 is a main person in a conference and then wants to view the contents posted by the main person separately, in this case, the text content corresponding to the seat 1 may be selected for viewing according to the tag.

Fig. 5 is a second flowchart of the speech processing method according to the embodiment of the present application, including:

s501: and acquiring voice data acquired by the microphone array as voice data to be recognized.

S502: the method comprises the steps of comparing voice data collected by each microphone in a microphone array, and determining the angle of a sound source of the voice data to be recognized relative to the microphone array as a sound source angle.

S503: and searching a seat identifier corresponding to the sound source angle in a mapping relation between the seat identifier and the angle which is established in advance, and taking the seat identifier as a sound source positioning result of the voice data to be recognized.

In a conference, a classroom, and the like, the seat is generally fixed, and therefore, a mapping relationship between the seat identification and the sound source angle can be established in advance. For example, as shown in fig. 4a, each seat represents a sound source position, and the sound collecting device is a circular microphone array; connecting the sound source and the circle center, and connecting the connecting line with a circle with a diameter l₇The angle of each sound source is represented by the included angle of (A), and the sound source angle corresponding to the seat 1 is α₁Establishing seat and angle α₁The mapping between seat 2 and angle α is established similarly₂Mapping relationship between the seat 3 and the angle α₁₃Mapping relationship between seat 4 and angle α₄The mapping relationship between them.

Assuming that voice data a to be recognized is acquired, it is determined that the angle of the sound source of the voice data a with respect to the microphone array is α₂Then, the positioning result of the sound source can be determined as the seat 2 according to the established mapping relationship.

S504: and according to the positioning result, carrying out beam forming on the voice data acquired by the microphone array to obtain enhanced voice data, and converting the enhanced voice data into character data.

S505: and marking the character data by taking the positioning result as a label.

S506: and correspondingly storing the voice data to be recognized and the marked character data to obtain a content record.

If the scheme is applied to a meeting scene, the content is recorded as a meeting record. If the scheme is applied to a classroom scene, the content record is a classroom record.

By applying the embodiment shown in fig. 5 of the present invention, on the first hand, the voice and the personnel information can be corresponded only by sound source localization, and therefore, the related calculation amount is less; in the second aspect, the voice data is enhanced by performing beam forming on the voice data, and then the enhanced voice data is converted into text data, so that the conversion effect is improved; in the third aspect, the seat identification is used as a label to mark the character data, so that which person corresponds to which content can be visually represented; and in the fourth aspect, the voice data and the marked text data are correspondingly stored, so that the completeness of the obtained content record is better.

Corresponding to the foregoing method embodiment, an embodiment of the present application further provides a speech processing apparatus, as shown in fig. 6, including:

an obtaining module 601, configured to obtain voice data to be recognized;

a positioning module 602, configured to perform sound source positioning on the voice data to be recognized to obtain a positioning result;

a conversion module 603, configured to convert the voice data to be recognized into text data;

a marking module 604, configured to mark the text data based on the positioning result.

As an embodiment, the obtaining module 601 may be specifically configured to: acquiring voice data acquired by a microphone array as voice data to be recognized;

the positioning module 602 may be specifically configured to: and comparing the voice data collected by each microphone in the microphone array to obtain a positioning result of the voice data to be recognized.

As an implementation manner, the conversion module 603 is specifically configured to:

As an embodiment, the positioning module 602 is specifically configured to: determining the angle of the sound source of the voice data to be recognized relative to sound collection equipment as a sound source positioning result of the voice data to be recognized;

the marking module 604 is specifically configured to: and marking the character data by taking the angle as a label.

As an embodiment, the positioning module 602 is specifically configured to: determining the angle of a sound source of the voice data to be recognized relative to sound collection equipment as a sound source angle; searching a seat identifier corresponding to the sound source angle in a mapping relation between a seat identifier and an angle which is established in advance, and using the seat identifier as a sound source positioning result of the voice data to be recognized;

the marking module 604 is specifically configured to: and marking the character data by taking the searched seat identification as a label.

As an embodiment, the apparatus further comprises:

and a storage module (not shown in the figure) for correspondingly storing the voice data to be recognized and the marked text data to obtain a content record.

The embodiment shown in fig. 6 of the invention is applied to carry out voice processing, carry out sound source positioning on voice data, and mark character data converted from the voice data based on the positioning result, namely, distinguish the contents (character data) published by different personnel according to the positions of the different personnel.

Embodiments of the present application also provide an electronic device, as shown in fig. 7, including a processor 701 and a memory 702,

a memory 702 for storing a computer program;

the processor 701 is configured to implement any of the above-described speech processing methods when executing the program stored in the memory 702.

The Memory mentioned in the above electronic device may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

An embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements any one of the above-mentioned speech processing methods.

An embodiment of the present application further provides a speech processing system, as shown in fig. 8, including: a sound collection device and a voice processing device; wherein the content of the first and second substances,

The speech processing apparatus may perform any of the speech processing methods described above.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, the electronic device embodiment, the computer-readable storage medium embodiment and the system embodiment, since they are substantially similar to the method embodiment, the description is relatively simple, and the relevant points can be referred to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method of speech processing, comprising:

acquiring voice data to be recognized;

converting the voice data to be recognized into character data;

and marking the character data based on the positioning result.

2. The method of claim 1, wherein the obtaining voice data to be recognized comprises: acquiring voice data acquired by a microphone array as voice data to be recognized;

3. The method of claim 2, wherein converting the speech data to be recognized into text data comprises:

4. The method according to claim 1, wherein the performing sound source localization on the speech data to be recognized to obtain a localization result comprises:

the marking the text data based on the positioning result comprises:

and marking the character data by taking the angle as a label.

5. The method according to claim 1, wherein the performing sound source localization on the speech data to be recognized to obtain a localization result comprises:

the marking the text data based on the positioning result comprises:

6. The method of claim 1, further comprising, after said marking the text data based on the positioning result:

7. A speech processing apparatus, comprising:

the acquisition module is used for acquiring voice data to be recognized;

8. The apparatus of claim 7, wherein the obtaining module is specifically configured to: acquiring voice data acquired by a microphone array as voice data to be recognized;

9. The apparatus of claim 8, wherein the conversion module is specifically configured to:

10. The apparatus according to claim 7, wherein the positioning module is specifically configured to: determining the angle of the sound source of the voice data to be recognized relative to sound collection equipment as a sound source positioning result of the voice data to be recognized;

11. The apparatus according to claim 7, wherein the positioning module is specifically configured to: determining the angle of a sound source of the voice data to be recognized relative to sound collection equipment as a sound source angle; searching a seat identifier corresponding to the sound source angle in a mapping relation between a seat identifier and an angle which is established in advance, and using the seat identifier as a sound source positioning result of the voice data to be recognized;

12. The apparatus of claim 7, further comprising:

13. A speech processing system, comprising: a sound collection device and a voice processing device; wherein the content of the first and second substances,