CN112468665A

CN112468665A - Method, device, equipment and storage medium for generating conference summary

Info

Publication number: CN112468665A
Application number: CN202011224273.6A
Authority: CN
Inventors: 曹乐; 李琪; 宋育芳; 李孔仁
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-03-09

Abstract

The invention discloses a method, a device, equipment and a storage medium for generating a conference summary, wherein the method comprises the following steps: determining a target meeting theme of the target meeting in response to the meeting event; determining voice information corresponding to each participant in the target conference; determining a target preset text matched with the target conference theme in a preset text library; the preset text library comprises preset texts associated with conference subjects, and the conference subjects comprise the target conference subjects; recognizing the voice information according to the target preset text to obtain a voice recognition text; and generating a conference summary according to the voice recognition text and a preset conference summary format. The conference summary generation method and the conference summary generation device can accurately generate the conference summary in real time according to the voice information in the multi-person conference, avoid the complexity and time consumption of manually arranging the conference voice text in the prior art, improve the generation efficiency of the conference summary, and greatly improve the accuracy and the real-time performance of conference voice recognition.

Description

Method, device, equipment and storage medium for generating conference summary

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method, a device, equipment and a storage medium for generating a conference summary.

Background

In the prior art, after the conference summary is recorded by a recording pen, voice recognition is performed by voice recognition software, and based on voice characteristics of different speakers in voice streams, voice information corresponding to each speaker is determined from the voice streams, so that a plurality of voice information is obtained, and then voice transcription is performed on the basis, so that the conference summary is generated. In a multi-person conference, the method often cannot accurately extract the voice information of different speakers, and a voice recognition system cannot accurately recognize the real semantics of the speakers and cannot perform real-time escaping.

Disclosure of Invention

The invention aims to provide a method, a device, equipment and a storage medium for generating a conference summary so as to improve the accuracy and the real-time performance of conference voice recognition and save labor cost.

The invention is realized by the following technical scheme:

in a first aspect, the present invention provides a method for generating a conference summary, including:

determining a target meeting theme of the target meeting in response to the meeting event;

determining voice information corresponding to each participant in the target conference;

determining a target preset text matched with the target conference theme in a preset text library; the preset text library comprises preset texts associated with conference subjects, and the conference subjects comprise the target conference subjects;

recognizing the voice information according to the target preset text to obtain a voice recognition text;

and generating a conference summary according to the voice recognition text and a preset conference summary format.

In a second aspect, the present invention provides a device for generating a conference summary, including:

the first acquisition module is used for responding to the conference event and determining a target conference theme of the target conference;

the second acquisition module is used for determining the voice information corresponding to each participant in the target conference;

the target preset text determining module is used for determining a target preset text matched with the target conference theme in a preset text library; the preset text library comprises preset texts associated with conference subjects, and the conference subjects comprise the target conference subjects;

the voice recognition module is used for recognizing the voice information according to the target preset text to obtain a voice recognition text;

and the conference summary generation module is used for generating a conference summary according to the voice recognition text and a preset conference summary format.

In a third aspect, the present invention provides an apparatus comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the method for generating a meeting summary described above.

In a fourth aspect, the present invention provides a computer-readable storage medium, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the method for generating a meeting summary described above.

The implementation of the technical scheme of the invention has the following beneficial effects:

the invention solves the problem that the existing voice recognition system is inaccurate in special word recognition by the aid of the preset text library and the preset words associated with the conference theme, can accurately generate conference summary in real time from voice information in a multi-person conference, avoids the complexity and time consumption of manually arranging conference voice texts in the prior art, improves the generation efficiency of the conference summary, is favorable for saving labor cost, and greatly improves the accuracy and the real-time performance of conference voice recognition.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flow chart of a method for generating a conference summary according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the microphone identification range provided by the embodiment of the invention;

fig. 3 is a schematic flowchart of a process of recognizing voice information according to a target preset text to obtain a voice recognition text according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a process for extracting keywords from a conference summary according to an embodiment of the present invention;

fig. 5 is a schematic flow chart of modifying a preset text library according to an embodiment of the present invention;

fig. 6 is a schematic diagram of creating a preset text library according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a device for generating a conference summary according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the following examples. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," "third," and the like in the description and in the claims, and in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Examples

The generation of the conference summary is generally achieved by three steps: firstly, acquiring voice information of participants through a voice acquisition device; then, sending the acquired voice information to a voice recognition system through a network for voice recognition; and finally, receiving the text content recognized by the system voice at the system interface, and generating a conference summary according to a preset conference summary format.

In the prior art, the conference recording mainly includes acquiring user voice by using a recording pen and then performing semantic content transcription by using a voice recognition tool. In a multi-person conference, because a plurality of participants participate in the discussion, the method often cannot accurately extract voices of different users, and the method needs to be manually arranged after the conference is finished so as to enable voice information to accurately correspond to a speaker, so that the method has no real-time performance and low efficiency. Moreover, since the conference generally has an exclusive theme, and a special word corresponding to the exclusive theme generally has a special feature, the special word adopted by the conference has special specificity, the semantic recognition function may not be able to accurately recognize the real semantics of the user, and for some texts with inaccurate speech recognition results existing in the conference summary, the user is required to further manually modify the conference summary after speech recognition, which results in an increase in labor cost.

Therefore, the present specification provides a technical solution that can simultaneously achieve the accuracy and the real-time performance of the conference voice recognition; specifically, the method comprises the following steps:

an embodiment of the present invention provides a method for generating a conference summary, such as the flowchart shown in fig. 1, and the present specification provides the method operation steps described in the embodiment or the flowchart, but may include more or less operation steps based on conventional or non-creative labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 1, the method may include:

s101: in response to the meeting event, a target meeting topic for the target meeting is determined. The conference generally has a special subject, and the subject of the conference is determined to be the basis of the establishment of a preset text library based on the conference subject.

S102: and determining the voice information corresponding to each participant in the target conference.

In the prior art, a voice of a user is obtained by using a recording pen, and then semantic content is transcribed by using a voice recognition tool. Because the method can not accurately extract the voices of different users, the method also needs manual finishing after the conference is finished, and has no real-time performance and low efficiency. When a plurality of persons participate in the meeting, a plurality of recording pens are needed to record voice, and the time sequence among the text information recognized by a plurality of voice streams needs to be confirmed manually, so that more labor cost is occupied. With the popularization of computers and the rapid development of internet technology, people have higher and higher requirements on intelligent experience. For example, when a multi-person conference discussion is performed, a user wants to recognize the speaking content of the participants in the conference in real time, and arrange the conference summary content according to the time line and the roles.

In a specific embodiment, determining the voice information corresponding to each participant in the target conference specifically includes: receiving target voice information of a preset angle of a target microphone; determining a target participant corresponding to the target microphone based on the corresponding relation between the microphones and the participants; and taking the target voice information as the voice information of the target participant to obtain the voice information corresponding to each participant.

As shown in fig. 2, microphones equal to the number of participants are provided, and the microphones correspond to the participants one-to-one, for example, in case of N participants, N microphones are provided, which are respectively a first microphone, a second microphone, a third microphone, and so on, each microphone corresponds to one participant, and the microphones are configured to receive voice information of a preset angle, where the preset angle refers to a maximum angle range of a sound source that can be received by the microphone with a position of the microphone as an origin and a connection line between the microphone and the participant corresponding to the microphone as a center line. As shown in fig. 2, the microphones 1, 2 and 3 are respectively corresponding to participants, the solid line indicates that the participant corresponding to the microphone is aligned with the microphone, the dotted line indicates the maximum range of the microphone that can receive voice, and the microphone cannot receive voice outside the dotted line. The embodiment of the invention considers the complexity of the environment when multiple persons communicate, focuses on how to accurately and clearly acquire the voice of each participant in a voice acquisition mode aiming at a special scene of a multi-person conference, and eliminates the common interference factors of simultaneous sound production of multiple sound sources, noise of the environment and the like.

Specifically, the position of the microphone is aligned with the position of each participant, and the voice receiving angle of each microphone may be 45 °, that is, the preset angle is 45 °, so as to ensure that only the voice information of the participant corresponding to the microphone is received, and the sound from other sound sources is shielded. After the microphone collects the voice information of each participant, the voice information is transmitted to a back-end system through a wireless network in real time for voice analysis and processing, and a semantic recognition stage is started.

In practical applications, the voice receiving angle of the microphone may be adjusted arbitrarily according to the actual performance of the microphone and the arrangement of the conference hall, which is not limited in the embodiment of the present invention.

It is understood that other speech capture methods may be used by those skilled in the art to achieve accurate and clear capture of the speech of each participant.

S103: determining a target preset text matched with the target conference theme in a preset text library; the preset text library comprises preset texts associated with conference subjects, and the conference subjects comprise target conference subjects.

The emphasis of the semantic recognition function is to realize that the system can accurately recognize the real semantics of the speaker. The conference generally has a special subject, and special words are provided corresponding to the subject. For meetings with different subjects, the same voice spoken by a speaker may correspond to different translated words, namely, homophone phenomenon occurs, homophone refers to a group of words with completely the same voice form and no connection in meaning, for example, "lodging-complaint, simple-quarantine, fighter-war, self-describing-word number" and the like, and according to the statistics of modern Chinese vocabularies, homophone accounts for about one tenth of the number of vocabularies, which brings great difficulty to voice recognition. Meanwhile, there are some special words or words for short which are only used in specific environments in the conference, for example, in a bank working conference, "credit loan process management system" is often referred to as "old CP" for short, and because the special words adopted in the conference with different topics have special specificity and other factors, the voice recognition system often cannot accurately recognize the real semantics of the user. Moreover, since the conference generally has a special theme, and a special vocabulary corresponding to the theme, the semantic recognition function cannot accurately recognize the real semantics of the user, and the user needs to manually modify the recognized conference summary, thereby increasing the labor cost.

The embodiment of the present invention solves the above problem of inaccurate speech recognition by presetting a text library, and as a specific implementation manner, the method further includes: acquiring a preset conference theme and preset words corresponding to the preset conference theme; establishing a corresponding relation between a preset conference theme and preset words; and generating a preset text corresponding to the preset conference theme in a preset text library based on the corresponding relation.

According to the conference theme, preset words related to the conference theme can be obtained, the preset words can comprise homophones, special words used in specific themes and contexts, short words and the like, and the preset words can be input as preset texts in a manual input mode. Specifically, obtaining a preset conference theme and preset words corresponding to the preset conference theme includes:

responding to a creation request of a preset text, and displaying a text creation interface, wherein the text creation interface comprises a theme input area and a word input area; and responding to the creation confirmation operation, and acquiring the input contents in the theme input area and the word input area to obtain a preset conference theme and preset words corresponding to the preset conference theme.

When the preset words are input into the preset text base in the manual mode, a user can judge whether the words are easily confused words such as homophones and the like according to actual conditions, so that the words can be input according to specific needs; for special words and short words used in specific environment and context, the user can also enter the words according to actual conditions so as to avoid the increase of labor cost caused by unnecessary entering.

In another specific embodiment, the obtaining of the preset conference theme and the preset words corresponding to the preset conference theme may further include:

responding to a creation request of a preset text, and acquiring an imported target document; determining a conference theme of a target document to obtain a preset conference theme; and extracting key words in the target document, wherein the key words are used as preset words corresponding to preset conference subjects.

Specifically, the key terms in the target document are extracted, where the key terms may be high-frequency terms, and the high-frequency terms may be extracted specifically according to whether the number of times that the terms appear in the target document is greater than a preset threshold.

It should be noted that, in the step of importing the high-frequency words into the preset text library, in order to accurately identify the real semantics, the user may further determine whether to import the high-frequency words into the preset text library according to actual conditions and specific needs. For example, although some words appear more times and belong to high-frequency words, the words are not related to conference subjects and belong to common words, and errors generally do not occur in speech recognition of the words, so that a user can choose not to import the high-frequency words into a preset text library.

S104: and recognizing voice information according to the target preset text to obtain a voice recognition text.

When the system recognizes a voice, preferentially detecting a preset text in a preset text library, in a specific embodiment, as shown in fig. 3, recognizing voice information according to a target preset text to obtain a voice recognition text, specifically including:

s301: performing semantic recognition according to the voice information to obtain a primary recognition text;

s302: searching whether a target preset text matched with the primary recognition text exists in a preset text library;

s303: if a target preset text matched with the primary recognition text exists in the preset text library, using the target preset text as a voice recognition text; and if the target preset text matched with the primary recognition text does not exist in the preset text library, obtaining the voice recognition text in an automatic judgment mode.

In the embodiment, after the voice information is acquired, the preset text in the preset text library is preferentially searched during voice recognition, and because the preset text library is associated with the conference theme, the semantic recognition is more accurate, so that the accuracy of the voice recognition can be improved.

S105: and generating a conference summary according to the voice recognition text and a preset conference summary format.

The conference summary generally has a preset format, and the conference summary can be obtained by obtaining the voice information of each participant and the voice recognition texts corresponding to the voice information in the above manner, and displaying the voice recognition texts of the participants according to the preset conference summary format according to the time sequence.

According to the embodiment of the invention, the problem that the existing voice recognition system is inaccurate in special word recognition is solved through the preset text library and the preset words associated with the conference theme, the conference summary can be generated accurately in real time by the voice information in the multi-person conference, the complexity and the time consumption of manually arranging the conference voice text in the prior art are avoided, the generation efficiency of the conference summary is improved, the labor cost is saved, and the accuracy and the real-time performance of the conference voice recognition are greatly improved.

In a specific embodiment, as shown in fig. 4, after this step, the method further comprises:

s401: extracting key words in the conference summary;

s402: and adding the key words as preset texts into a preset text library.

Specifically, a key word in the conference summary is extracted, the key word may be a high-frequency word, for example, and the high-frequency word may be extracted specifically according to whether the number of times of occurrence of the word in the conference summary is greater than a preset threshold.

It should be noted that, in the step of adding the high-frequency words as the preset text to the preset text library, in order to accurately identify the real semantics, the user may further determine whether to import the high-frequency words into the preset text library according to actual conditions and specific needs. For example, although some words appear more times and belong to high-frequency words, the words are not related to conference subjects and belong to common words, and errors generally do not occur in speech recognition of the words, so that a user can choose not to import the high-frequency words into a preset text library as preset texts.

In a specific embodiment, as shown in fig. 5, after this step, the method further comprises:

s501: responding to a modification instruction of the conference summary, and determining a target word to be modified in the conference summary;

s502: determining a preset word to be replaced corresponding to the target word to be modified in the target preset text;

s503: and carrying out replacement operation on the preset words to be replaced according to the modified texts corresponding to the target words to be modified.

In the embodiment of the invention, in order to further improve the accuracy of the conference voice recognition, the conference summary can be modified after the conference summary is generated. Further, based on the modified conference summary, the preset text base can be modified to obtain a more accurate preset text base.

As shown in fig. 6, in the embodiment of the present invention, the generation manner of the preset text library mainly includes several manners of pre-importing, automatically adding, and manually modifying, specifically, the pre-importing may include manually entering preset words related to a conference topic, and importing key words in a target document as preset texts; the automatic identification refers to extracting key words in the conference summary and adding the key words into a preset text library; the manual modification refers to the modification of the conference summary, and the modified text is imported into a preset text library so as to obtain more accurate preset text, and further improve the accuracy of conference voice recognition.

According to the technical scheme provided by the embodiment of the specification, the problem that the existing voice recognition system is inaccurate in special word recognition is solved through the preset text library and the preset words associated with the conference theme, and the accuracy of voice transcription is greatly improved; the conference summary generation method and the conference summary generation device adopt the multiple microphones, and set the angle of the microphones for receiving the voice to ensure that only the voice information of the participants corresponding to the microphones is received, so that the voice information in the multi-person conference can be generated accurately in real time, the complexity and time consumption of manually arranging the conference voice text in the prior art are avoided, the generation efficiency of the conference summary is improved, the labor cost is saved, the accuracy and the real-time performance of conference voice recognition are greatly improved, and the user experience is improved.

An embodiment of the present invention further provides a device for generating a conference summary, as shown in fig. 7, the device for generating a conference summary in this embodiment includes: the first acquisition module is used for responding to the conference event and determining a target conference theme of the target conference; the second acquisition module is used for determining the voice information corresponding to each participant in the target conference; the target preset text determining module is used for determining a target preset text matched with the target conference theme in a preset text library; the preset text library comprises preset texts associated with conference subjects, and the conference subjects comprise target conference subjects; the voice recognition module is used for recognizing voice information according to the target preset text to obtain a voice recognition text; and the conference summary generation module is used for generating a conference summary according to the voice recognition text and a preset conference summary format.

It should be noted that the device and method embodiments in the device embodiment are based on the same inventive concept. For details, please refer to the method embodiment, which is not described herein.

An embodiment of the present invention provides an electronic device, where the electronic device includes a processor and a memory, where the memory stores at least one instruction or at least one program, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the method for generating a conference summary provided in the above method embodiment.

The embodiment of the present invention further provides a storage medium, where the storage medium may be disposed in an electronic device to store at least one instruction or at least one program for implementing a virus detection method in the method embodiment, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the method for generating a conference summary provided in the method embodiment.

Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device and electronic apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the above claims.

Claims

1. A method for generating a conference summary, comprising:

2. The method of generating a conference summary according to claim 1, further comprising:

acquiring a preset conference theme and preset words corresponding to the preset conference theme;

establishing a corresponding relation between the preset conference theme and the preset words;

and generating a preset text corresponding to the preset conference theme in a preset text library based on the corresponding relation.

3. The method for generating the conference summary according to claim 2, wherein obtaining a preset conference subject and a preset word corresponding to the preset conference subject comprises:

responding to a creation request of a preset text, and displaying a text creation interface, wherein the text creation interface comprises a theme input area and a word input area;

and responding to a creation confirmation operation, acquiring the input contents in the theme input area and the word input area, and obtaining the preset conference theme and preset words corresponding to the preset conference theme.

4. The method for generating the conference summary according to claim 2, wherein obtaining a preset conference subject and a preset word corresponding to the preset conference subject comprises:

responding to a creation request of a preset text, and acquiring an imported target document;

determining a conference theme of the target document to obtain the preset conference theme;

and extracting key terms in the target document, wherein the key terms are used as preset terms corresponding to the preset conference theme.

5. The method for generating a conference summary according to claim 1, wherein the recognizing the voice information according to the target preset text to obtain a voice recognition text specifically comprises:

performing semantic recognition according to the voice information to obtain a primary recognition text;

searching whether a target preset text matched with the preliminary recognition text exists in a preset text library;

if a target preset text matched with the primary recognition text exists in a preset text library, using the target preset text as a voice recognition text; and if the target preset text matched with the preliminary recognition text does not exist in the preset text library, obtaining the voice recognition text in an automatic judgment mode.

6. The method for generating a conference summary according to claim 1, after generating a conference summary according to a preset conference summary format based on the speech recognition text, further comprising:

extracting key words in the conference summary;

and adding the key words as preset texts to the preset text library.

7. The method for generating a conference summary according to claim 1, after generating a conference summary according to a preset conference summary format based on the speech recognition text, further comprising:

responding to a modification instruction of a conference summary, and determining a target word to be modified in the conference summary;

determining a preset word to be replaced corresponding to the target word to be modified in the target preset text;

and replacing the preset words to be replaced according to the modified texts corresponding to the target words to be modified.

8. The method for generating a conference summary according to claim 1, wherein determining the voice information corresponding to each participant in the target conference specifically comprises:

receiving target voice information of a preset angle of a target microphone;

determining a target participant corresponding to the target microphone based on the corresponding relation between the microphones and the participants;

and taking the target voice information as the voice information of the target participant to obtain the voice information corresponding to each participant.

9. An apparatus for generating a conference summary, comprising:

10. An apparatus comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, the at least one instruction, the at least one program, set of codes, or set of instructions being loaded and executed by the processor to implement a method of generating a conference summary according to any one of claims 1 to 8.

11. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement a method of generating a conference summary according to any one of claims 1 to 8.