CN111402892A

CN111402892A - Conference recording template generation method based on voice recognition

Info

Publication number: CN111402892A
Application number: CN202010210036.8A
Authority: CN
Inventors: 钱敏
Original assignee: Zhengzhou Zhilixin Information Technology Co ltd
Current assignee: Zhengzhou Zhilixin Information Technology Co ltd
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2020-07-10

Abstract

The invention relates to a conference recording template generation method based on voice recognition, which comprises the steps of acquiring an audio signal of a speaker, recording the acquisition time of the audio signal, processing the audio signal or the acquisition time to correspondingly obtain a conference recording blank template, a target conference item, a conference keyword, a target name of the speaker and a target face image, correspondingly filling the conference recording blank template, the target conference item, the conference keyword, the target name of the speaker and the target face image into the conference recording blank template to generate a target conference recording template, and finally displaying the target conference recording template on a corresponding screen. The conference recording template is not required to be specially manufactured by conference recording personnel, and the corresponding information is not required to be specially filled into the conference recording blank template by the conference recording personnel, so that the workload of the conference recording personnel is reduced, the condition of recording errors caused by artificial generation and filling can be avoided, and the accuracy of generating the conference recording is improved.

Description

Conference recording template generation method based on voice recognition

Technical Field

The invention relates to a conference recording template generation method based on voice recognition.

Background

When a conference, especially an important conference, is carried out, special conference recording personnel are needed to record the progress of the conference, and a common conference recording mode is as follows: a conference recording template is prepared in advance on a computer, and related contents are filled into the conference recording template by conference recording personnel. The existing conference recording templates are all blank tables, and during the conference, conference preparation contents, such as conference time, speaker information and the like, are filled into corresponding areas, and the contents belong to the conference recording templates but not the contents of the conference recording text, so that the contents of the conference recording templates bring certain extra workload to conference recording personnel, increase the workload of the conference recording personnel, and possibly cause the situation of recording errors in the rapid recording process.

Disclosure of Invention

The invention aims to provide a conference recording template generation method based on voice recognition, which is used for solving the problems that extra workload is brought to conference recording personnel and the workload of the conference recording personnel is increased due to the fact that the contents of the conference recording template are specially filled in.

In order to solve the problems, the invention adopts the following technical scheme:

a conference recording template generation method based on voice recognition comprises the following steps:

acquiring an audio signal of a speaker, and recording the acquisition time of the audio signal;

generating a conference recording blank template according to the audio signal, wherein the conference recording blank template comprises a conference item filling area, a conference keyword filling area, a speaker name filling area, a speaker face image filling area and a conference recording text filling area;

determining a target conference item corresponding to the acquisition time of the audio signal according to the acquisition time of the audio signal and a preset conference flow;

identifying the audio signal to obtain corresponding text data;

extracting the meeting keywords contained in the text data;

acquiring a target voiceprint of the speaker according to the audio signal;

inputting the target voiceprint into a preset conference staff database, and acquiring a target name and a target face image of a speaker corresponding to the target voiceprint; the conference personnel database comprises at least two groups of data, wherein each group of data comprises a voiceprint, and the name and face image of a personnel corresponding to the voiceprint;

filling the target conference item into a conference item filling area of the conference recording blank template, filling the conference keyword into a conference keyword filling area of the conference recording blank template, filling the target name of the speaker into a speaker name filling area of the conference recording blank template, filling the target face image of the speaker into a speaker face image filling area of the conference recording blank template, and generating a target conference recording template;

and displaying the target conference recording template on a corresponding screen.

Optionally, after the target meeting record template is displayed on a corresponding screen, the method for generating a meeting record template further includes:

and outputting the target conference recording template to a printer for printing.

Optionally, the extracting the conference keyword included in the text data includes:

and inputting the text data into a preset conference keyword database to obtain the conference keywords in the text data.

Optionally, the recognizing the audio signal to obtain corresponding text data includes:

generating a voice waveform diagram of the audio signal in a preset voice coordinate system;

based on a voice activity detection algorithm, dividing the voice oscillogram to obtain at least two effective voice sections;

extracting a voice characteristic curve corresponding to each effective voice section through a voice characteristic recognition algorithm;

extracting a standard characteristic curve associated with each candidate character from a preset corpus;

drawing the standard characteristic curve and the voice characteristic curve on a preset characteristic coordinate, and calculating the difference area of an intersection region between the standard characteristic curve and the voice characteristic curve;

if the difference area of any candidate character is smaller than a preset area difference threshold, identifying the candidate character as character information contained in a corresponding effective speech section;

and sequentially combining the character information based on the sequence of each effective voice section in the voice oscillogram to generate the character data.

The invention has the beneficial effects that: generating a conference recording blank template according to the obtained audio signal of the speaker, wherein the conference recording blank template is an initial conference recording template and comprises a conference item filling area, a conference keyword filling area, a speaker name filling area, a speaker face image filling area and a conference recording text filling area, and then carrying out the following processing: determining a target conference item corresponding to the acquisition time of the audio signal according to the acquisition time of the audio signal and a preset conference flow; identifying the audio signal to obtain corresponding text data, and extracting a conference keyword contained in the text data; acquiring a target voiceprint of a speaker according to the audio signal, inputting the target voiceprint into a preset conference staff database, and acquiring a target name and a target face image of the speaker corresponding to the target voiceprint; then, filling the obtained target conference item into a conference item filling area of a conference recording blank template, filling a conference keyword into a conference keyword filling area of the conference recording blank template, filling a speaker target name into a speaker name filling area of the conference recording blank template, filling a speaker target face image into a speaker face image filling area of the conference recording blank template, generating a target conference recording template, and finally displaying the target conference recording template on a corresponding screen. Therefore, the method generates corresponding data information based on the audio signal of the speaker, and then automatically and correspondingly fills the generated data information into the generated blank template of the conference record to obtain the target template of the conference record.

Drawings

In order to more clearly illustrate the technical solution of the embodiment of the present invention, the drawings needed to be used in the embodiment will be briefly described as follows:

fig. 1 is a flowchart of a conference recording template generation method based on speech recognition.

Detailed Description

The embodiment provides a conference recording template generation method based on voice recognition, and an execution main body of the conference recording template generation method is an intelligent mobile terminal (such as a smart phone or a tablet computer), a computer (such as a notebook computer, a desktop computer or a computer host), a server device and the like. The subject of the present application is not specifically limited. The application scene to which the conference recording template generating method is applied can be a conference room.

As shown in fig. 1, the method for generating a conference recording template includes the following steps:

acquiring an audio signal of a speaker, and recording the acquisition time of the audio signal:

the audio signal of the speaker is captured by a microphone, which may be fixed to the speaking table in the conference room.

When the audio signal of the speaker is acquired, the acquisition time of the audio signal, that is, when the audio signal is acquired, is recorded.

And performing the following four processing procedures according to the acquired voice signal of the speaker or the acquisition time of the voice signal, and performing subsequent processing by combining the processing results of the four processing procedures. It should be understood that there is no strict sequence between the four processing procedures, and the sequence may be set according to actual needs, or may be performed simultaneously.

The first treatment process comprises the following steps:

generating a conference recording blank template according to the audio signal, wherein the conference recording blank template comprises a conference item filling area, a conference keyword filling area, a speaker name filling area, a speaker face image filling area and a conference recording text filling area:

after the audio signal is acquired, a conference recording blank template is generated according to the audio signal, wherein the conference recording blank template is an initial conference recording template and is used for obtaining a target conference recording template according to the conference recording template and relevant data information obtained subsequently. The conference recording blank template comprises a conference item filling area, a conference keyword filling area, a speaker name filling area, a speaker face image filling area and a conference recording text filling area. The conference item filling area is used for filling conference items, the conference keyword filling area is used for filling conference keywords, the speaker name filling area is used for filling names of speakers, the speaker face image filling area is used for filling face images of the speakers, and the conference record text filling area is used for filling conference records. Table 1 shows a specific template structure of a conference recording blank template, where an area a is a conference item filling area, an area B is a conference keyword filling area, an area C is a speaker name filling area, an area D is a speaker face image filling area, and an area E is a conference recording text filling area.

TABLE 1

And a second treatment process:

determining a target conference item corresponding to the acquisition time of the audio signal according to the acquisition time of the audio signal and a preset conference flow:

a conference process is preset, which includes at least two conference time periods and a conference process (i.e. a conference item) corresponding to each conference time period, for example: the conference items of 9:00-10:00 are the general manager to speak, the conference items of 10:00-11:00 are the department manager to speak, and the conference items of 11:00-12:00 are the staff representatives to speak.

Then, according to the acquisition time of the audio signal and a preset conference flow, a target conference item corresponding to the acquisition time of the audio signal can be determined, for example: if the acquisition time of the audio signal is 9:35, the target conference item corresponding to the acquisition time of the audio signal can be determined as the general manager speaking by combining a preset conference flow.

And a third treatment process:

identifying the audio signal to obtain corresponding text data:

and carrying out voice recognition on the audio signal to obtain corresponding text data. The voice recognition is performed on the audio signal to obtain the text data, which belongs to the conventional technical means, and a specific implementation process is provided in this embodiment. The specific implementation process steps given in this embodiment include:

(1) and generating a voice waveform diagram of the audio signal in a preset voice coordinate system. The ordinate of the speech coordinate system can be audio amplitude, and the abscissa can be acquisition time, so that a speech waveform map based on a time domain is generated. In addition, before the voice oscillogram is generated, the audio signal can be filtered to obtain the audio signal without environmental noise, and the audio signal after noise filtering can be subjected to mild processing, so that an invalid noise frequency band can be filtered.

(2) And based on a voice activity detection algorithm, dividing the voice oscillogram to obtain at least two effective voice sections. The valid speech segment refers to a speech segment containing the content of the utterance, and correspondingly, the invalid speech segment refers to a speech segment not containing the content of the utterance. A voice start amplitude and a voice end amplitude may be set, where the voice start amplitude is greater than the voice end amplitude, i.e. the start requirement of the active voice segment is higher than the end requirement of the active voice segment. Because the speaker is at the beginning time of speaking, the volume and the tone are often higher, and the value of the corresponding voice amplitude is higher at the moment; in the speaking process, some characters have weak tones or soft tones, and the interruption of speaking should not be recognized, so the ending amplitude of the speech needs to be properly reduced to avoid the occurrence of misidentification. Therefore, according to the voice starting amplitude and the voice ending amplitude, effective voice recognition is carried out on the voice oscillogram, so that at least two effective voice sections are obtained through division, wherein the amplitude corresponding to the starting time of the effective voice sections is larger than or equal to the voice starting amplitude, and the amplitude corresponding to the ending time is smaller than or equal to the voice ending amplitude. It should be appreciated that other implementations may be used in addition to the above described implementation of the division of active speech segments.

(3) And extracting the voice characteristic curve corresponding to each effective voice section through a voice characteristic recognition algorithm. In this embodiment, the voice feature recognition algorithm may be a fourier algorithm, and the effective voice segments are converted from the time domain curve to the frequency domain waveform to obtain the voice feature curves corresponding to the effective voice segments. In addition, if the frequency domain waveform obtained by conversion is a discrete waveform, the discrete waveform can be linearly fitted by a linear fitting method, and a corresponding voice characteristic curve is output.

(4) And extracting the standard characteristic curve associated with each candidate character from a preset corpus. A corpus is preset, wherein the corpus contains all candidate characters which can be identified, and each candidate character corresponds to an associated standard characteristic curve. The standard characteristic curve can be obtained by converting a speech signal of a standard pronunciation of at least one language. If multiple different languages can be identified, the speech signals corresponding to the standard pronunciations of the languages can be subjected to speech feature algorithm extraction to obtain multiple different standard feature curves, and the multiple standard feature curves and the candidate characters are associated.

(5) And drawing a standard characteristic curve and a voice characteristic curve on a preset characteristic coordinate, and calculating the difference area of the intersection area between the standard characteristic curve and the voice characteristic curve. In this embodiment, a standard characteristic curve and a voice characteristic curve are drawn on the same characteristic coordinate system, so that the difference between the two curves can be quickly compared, wherein the calculation of the difference is mainly determined by the size of the intersection area (i.e. the difference area of the intersection region) between the two curves: if the intersection area is larger, the larger the difference degree between the two curves is, the higher the probability that the candidate character is not contained in the effective speech section is; conversely, if the intersection area is smaller, the smaller the difference between the two curves is, the higher the probability that the valid speech segment contains the candidate character is. Furthermore, in order to improve the recognition accuracy, the speech characteristic curve is normalized, the speech oscillogram is divided into a plurality of different character segments according to the peak value change of the speech oscillogram of the effective speech segment, and one character segment comprises at least one peak value, so that each character segment can be ensured to correspond to one character. And normalizing the character segment in a time domain according to the length of the character segment, namely setting the time length of the character segment as preset standard time length, adjusting the amplitude value of the character segment in an equal proportion according to preset maximum amplitude, and converting a standard characteristic curve of the normalized character segment to obtain a voice characteristic curve corresponding to the normalized character segment.

(6) And if the difference area of any candidate character is smaller than the preset area difference threshold, identifying the candidate character as the character information contained in the corresponding effective speech section. If the difference area of the intersection area between the standard characteristic curve and the voice characteristic curve of any candidate character is smaller than the difference threshold value, the candidate character can be identified in the speaking content of the effective voice section, the order of each identified candidate character is determined according to the occurrence position of each identified candidate character in the effective voice section, and the candidate characters are combined based on the order to obtain character information. The standard characteristic curve of each candidate character is compared with the voice characteristic curve, so that the character information contained in the effective voice section is identified, and the accuracy of generating the character information is improved.

(7) And sequentially combining the character information based on the sequence of each effective voice section in the voice oscillogram to generate character data. Specifically, punctuation marks used for connecting two character information can be determined according to the association degree between the last character of the last effective speech section and the first character of the next effective speech section and the interval duration between the two speech sections, and the character information is generated by identifying each character information and the punctuation marks used for connecting, so that the readability of the character information is improved. In the embodiment, the audio signal is divided into the plurality of voice sections, so that the data volume of voice recognition at each time can be reduced, the accuracy rate and the calculated amount of the voice recognition are considered, and the accuracy of generating the conference recording template is improved.

Extracting the meeting keywords contained in the text data:

after the text data is acquired, the conference keywords related to the conference are extracted from the text data, and this embodiment provides an implementation manner: a conference key database is preset, the conference key database comprises at least one conference key, and the conference key in the conference key database is specifically set according to actual conditions, such as conference subjects. And inputting the text data into a preset conference keyword database, comparing the text data with each conference keyword in the conference keyword database one by one, and if the conference keyword in the conference keyword database exists in the text data, extracting the conference keyword to obtain the conference keyword in the text data. Further, in order to improve the recognition efficiency, the text data may be split into a plurality of words or single characters, each word or single character is respectively input into the conference keyword database, and the conference keywords in the text data are obtained through comparison.

As another embodiment, the text data may be semantically analyzed by a semantic analysis algorithm, and the meeting keywords may be extracted from the text data. In addition, a conference theme can be preset, and conference keywords are extracted from the text data based on the conference theme.

It should be understood that if the extracted text data does not contain the meeting keyword, the meeting keyword is not output.

And a fourth treatment process:

acquiring a target voiceprint of the speaker according to the audio signal:

and identifying the voiceprint of the obtained audio signal through a voiceprint identification algorithm, wherein the voiceprint is the target voiceprint of the speaker. Voiceprint (Voiceprint) is a spectrum of sound waves carrying verbal information. The voiceprint is unique like a fingerprint and has the function of identity recognition (identification of an individual). Each person has a specific voiceprint, which varies from person to person. Regardless of how one intentionally simulates the voice and tone of another, even if the simulation is vivid, the voiceprint is still different.

To facilitate voiceprint recognition, the audio signal may be some more common sentence, such as "family good". The voiceprint recognition algorithm is a conventional technology, and the implementation process of obtaining the voiceprint according to the audio signal also belongs to the conventional technology, and is not described in detail.

Inputting the target voiceprint into a preset conference staff database, and acquiring a target name and a target face image of a speaker corresponding to the target voiceprint; wherein the conference personnel database comprises at least two groups of data, each group of data comprises a voiceprint, and a name and a face image of a person corresponding to the voiceprint:

the conference personnel database is preset, the conference personnel database comprises at least two groups of data, each group of data comprises a voiceprint, and the name and the face image of a personnel corresponding to the voiceprint, the conference personnel database can be stored in a data table mode, and a specific implementation mode of the conference personnel database is given in a table 2.

TABLE 2

Wherein the voiceprint X1, the person name Y1 and the person face image Z1 correspond; voiceprint X2, person name Y2, and person face image Z2, and so on.

It should be understood that when the conference personnel database is established, the voiceprint, the personnel name and the personnel face image are input in advance, wherein the personnel face image is obtained by shooting through a camera in advance, or the existing photo is directly uploaded into the conference personnel database, and then the corresponding relation among the voiceprint, the personnel name and the personnel face image is established.

Inputting the target voiceprint into the conference person database, the target name and the target face image of the speaker corresponding to the target voiceprint can be obtained. For example, if the target voiceprint is X2, the target name of the speaker corresponding to the target voiceprint X2 is Y2 and the target face image is Z2.

The four processing procedures respectively obtain a target conference item, a conference keyword, a target name of a speaker and a target face image of the speaker. Then, next:

filling the target conference item into a conference item filling area of the conference recording blank template, filling the conference keyword into a conference keyword filling area of the conference recording blank template, filling the target name of the speaker into a speaker name filling area of the conference recording blank template, filling the target face image of the speaker into a speaker face image filling area of the conference recording blank template, and generating a target conference recording template:

filling the obtained target conference item into a conference item filling area of a conference recording blank template (namely, an area a in table 1), filling a conference keyword into a conference keyword filling area of the conference recording blank template (namely, an area B in table 1), filling a speaker name filling area of the conference recording blank template (namely, an area C in table 1), filling a speaker face image filling area of the conference recording blank template (namely, an area D in table 1), and generating a file with relevant data filled in all of the area a, the area B, the area C and the area D, wherein the file is the target conference recording template.

Displaying the target conference recording template on a corresponding screen:

in order to facilitate conference recording by the conference recording personnel according to the target conference recording template, the target conference recording template is displayed on a corresponding screen, for example, a computer screen, and the conference recording personnel can perform conference recording by operating the computer, that is, specific content of the conference recording is filled into a conference recording text filling area.

In addition, if the meeting record text filling area in the target meeting record template needs to be written with the specific content of the meeting record by handwriting with a pen, the target meeting record template needs to be printed, that is, after the "displaying the target meeting record template on the corresponding screen", the method for generating the meeting record template further includes:

outputting the target meeting record template to a printer for printing:

and outputting the generated target meeting record template to a printer for printing to obtain a paper target meeting record template.

The above-mentioned embodiments are merely illustrative of the technical solutions of the present invention in a specific embodiment, and any equivalent substitutions and modifications or partial substitutions of the present invention without departing from the spirit and scope of the present invention should be covered by the claims of the present invention.

Claims

1. A conference recording template generation method based on voice recognition is characterized by comprising the following steps:

identifying the audio signal to obtain corresponding text data;

extracting the meeting keywords contained in the text data;

acquiring a target voiceprint of the speaker according to the audio signal;

2. The method for generating a conference recording template based on speech recognition according to claim 1, wherein after the target conference recording template is displayed on a corresponding screen, the method further comprises:

3. The method for generating a conference recording template based on speech recognition according to claim 1, wherein the extracting of the conference keyword included in the text data includes:

4. The method for generating a conference recording template based on speech recognition according to claim 1, wherein the recognizing the audio signal to obtain corresponding text data comprises: