CN112597912A

CN112597912A - Conference content recording method, device, equipment and storage medium

Info

Publication number: CN112597912A
Application number: CN202011569099.9A
Authority: CN
Inventors: 姚元庆; 郑婕; 高崟鑫; 刘若愚; 宋少华; 赵亚莉
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2020-12-26
Filing date: 2020-12-26
Publication date: 2021-04-02

Abstract

The application discloses a conference content recording method, a conference content recording device, conference content recording equipment and a conference content recording storage medium. The method comprises the following steps: receiving a conference video; determining an action area of a target action in the conference video; determining the identification of the target action according to the face image information of the face area closest to the action area; and generating the conference content according to the identification of the target action and the target expression information of the target action. The method realizes automatic recording of the conference content of the deaf-mute, reduces the workload of conference recording personnel, and improves the recording efficiency of the conference content.

Description

Conference content recording method, device, equipment and storage medium

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recording conference content.

Background

At present, conference recording usually uses a microphone or other devices to record the speech of each participant during the conference. After the conference is finished, the participants and the speech contents of the participants in the conference process are sorted and recorded by playing the sound records.

However, in the above scheme, the participant is a deaf-mute who is not convenient to speak, and the deaf-mute communicates with other deaf-mutes through gesture actions. In the process, the deaf-mute transmits information to other deaf-mute through action, so that the microphone cannot collect sound signals.

Therefore, a method for recording the conference content of the deaf-mute is needed in the art.

Disclosure of Invention

In order to solve the technical problem, the application provides a method, a device, equipment and a storage medium for recording conference content. The method can automatically generate the information of the deaf-mute and the content expressed by the deaf-mute by identifying the target action and the face information of the deaf-mute, and can effectively solve the problem of recording the conference content of the deaf-mute.

The embodiment of the application discloses the following technical scheme:

in a first aspect, the present application provides a method for recording meeting content, including:

receiving a conference video;

determining an action area of a target action in the conference video;

determining the identification of the target action according to the face image information of the face area closest to the action area;

and generating conference content according to the identification of the target action and the target expression information of the target action.

Optionally, the determining, according to the face image information of the face region closest to the action region, the identifier of the target action includes:

determining the central point of the action area and the central points of the N face areas; wherein N is more than or equal to 1 and is an integer;

if N is 1, acquiring first face image information of one face area; determining the identification of the target action according to the first face image information;

if N is larger than 1, the distance between the center point of the action area and the center point of each face area is obtained, the center point of the face area closest to the center point of the action area is determined, and second face image information corresponding to the center point of the face area closest to the center point of the action area is obtained; and determining the identifier of the target action according to the second face image information.

Optionally, the generating the conference content according to the identifier of the target action and the expression information of the target action includes:

acquiring a mapping relation between actions and expression information;

determining target expression information corresponding to the target action according to the mapping relation;

and binding the target expression information with the identification of the target action to generate the conference content.

Optionally, the identifier of the target action includes a name of the target object or a job number of the target object.

Optionally, the target action comprises an arm action and/or a lip action.

Optionally, the method further includes:

and presenting the conference content.

Optionally, the method further includes:

receiving configuration information of a user;

and exporting the conference content according to the file format indicated by the configuration information.

In a second aspect, the present application provides an apparatus for recording conference content, including: the device comprises a receiving module, a processing module and a generating module;

the receiving module is used for receiving the conference video;

the processing module is used for determining an action area of a target action in the conference video; determining the identification of the target action according to the face image information of the face area closest to the action area;

and the generating module is used for generating the conference content according to the identification of the target action and the target expression information of the target action.

Optionally, the processing module is specifically configured to determine a central point of the action region and central points of N face regions; wherein N is more than or equal to 1 and is an integer;

Optionally, the generating module is specifically configured to obtain a mapping relationship between the action and the expression information; determining target expression information corresponding to the target action according to the mapping relation; and binding the target expression information with the identification of the target action to generate the conference content.

Optionally, the target action comprises an arm action and/or a lip action.

Optionally, the apparatus further comprises: a display module;

the display module is used for presenting the conference content.

Optionally, the apparatus further comprises: a derivation module;

the receiving module is specifically used for receiving configuration information of a user;

and the export module is used for exporting the conference content according to the file format indicated by the configuration information.

In a third aspect, the present application provides a recording device for conference content, the device comprising: a memory and a processor;

the memory is used for storing a computer program and transmitting the computer program to the processor;

the processor, according to instructions in the computer program, performs the method of any of the first aspects.

In a fourth aspect, the present application provides a computer readable storage medium for storing computer software instructions which, when run on a computer, cause the computer to perform the method of any of the first aspect above.

According to the technical scheme, the method has the following beneficial effects:

the application provides a conference content recording method, a conference content recording device, conference content recording equipment and a conference content recording storage medium. The method comprises the following steps: receiving a conference video; determining an action area of a target action in the conference video; determining the identification of the target action according to the face image information of the face area closest to the action area; and generating conference content according to the identification of the target action and the target expression information of the target action. In the method, an action area of a target action in a video image is determined through an image recognition technology, then a face area close to the action area is determined, face image information of the face area is obtained, and an identifier corresponding to the face image information, namely an identifier of the target action, is determined and is used for describing an executor of the target action. And recording the conference content by identifying the target action, determining expression information of the target action and corresponding the expression information of the target action with the executor of the target action. The method realizes the automatic recording of the conference content of the deaf-mute, reduces the workload of the conference recording personnel and improves the recording efficiency of the conference content.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a system architecture diagram of a recording system according to an embodiment of the present application;

fig. 2 is a schematic interface diagram of an interaction subsystem according to an embodiment of the present application;

fig. 3 is a schematic interface diagram of a conference video according to an embodiment of the present application;

fig. 4 is a schematic interface diagram of a conference video according to an embodiment of the present application;

fig. 5 is a flowchart of a method for recording meeting content according to an embodiment of the present application;

fig. 6 is a flowchart of a still another recording method for meeting content according to an embodiment of the present disclosure;

fig. 7 is a schematic diagram of a device for recording conference content according to an embodiment of the present application;

fig. 8 is a schematic diagram of a computing device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For the sake of understanding, the technical terms related to the present application will be described below.

The conference content refers to the speaking information of the participants in the conference process, for example, the conference content includes the speaker and the speaking content of the speaker. In the related art, the voice signal of the speaker can be processed by a voice processing technique. The name of the speaker is determined by the voice signal, and the voice signal is converted into character information for recording. However, for a participant who is inconvenient to speak (e.g., a deaf-mute), the speaking information of the participant cannot be recorded by a sound signal. Therefore, a method for recording the conference content of the deaf-mute is needed in the art.

In view of the above, the present application provides a method for recording meeting content. The method may be implemented by a recording system. Specifically, the recording system receives a conference video, determines an action area of a target action in the conference video, wherein the target action can be an arm action (such as sign language and gesture) or a lip action (such as lip language) of a participant. Then, the recording system determines a face area closest to the action area, and determines the identification of the target action according to the face image information of the face area. And finally, the recording system generates the conference content according to the identification of the target action and the target expression information of the target action. Thereby enabling recording of the meeting content.

The method comprises the steps of identifying an action area of a target action in a conference video through an image identification technology to determine a face area corresponding to the target action, and determining an identifier of the target action through face image information of the face area, wherein the identifier is used for identifying an executor of the target action. The method determines the target expression information of the target action through an image recognition technology, so that the character information corresponding to the target action can be automatically recognized. The method automatically corresponds the executor with the expression information, and further can generate the conference content corresponding to the conference video. Therefore, the method can realize the recording of the conference content of the deaf-mute.

The recording system may be a software system. In particular, the recording system may be deployed in a computing device in the form of computer software to implement functionality for recording meeting content. In some embodiments, the recording system may also be a hardware system. The hardware system includes a physical device having a function of recording conference content.

The recording system is realized by subsystems with different functions and units with different functions. The embodiment of the present application does not limit the partitioning manner of subsystems and units in the recording system, and the following description is made with reference to an exemplary partitioning manner shown in fig. 1.

As shown in fig. 1, recording system 100 includes an interaction subsystem 120 and a recording subsystem 140. The interactive subsystem 120 is used for providing a Graphical User Interface (GUI) to a user and receiving a conference video. The recording subsystem 140 is used for determining an action area of a target action in the conference video; determining the identification of the target action according to the face image information of the face area closest to the action area; and generating conference content according to the identification of the target action and the target expression information of the target action.

The interaction subsystem 120 includes a communication unit 122 and a display unit 124. Recording subsystem 140 includes communication unit 142, processing unit 144, and generation unit 146. Described separately below.

The communication unit 122 is used to receive conference video. The display unit 124 is configured to provide a GUI, refer to fig. 2, which is an interface schematic diagram of a main interface of an interaction subsystem provided in an embodiment of the present application, and as shown in fig. 2, an upload control 220 is carried on the main interface 200. The user may enter a conference video through the upload control 220. The communication unit 122 can retrieve the conference video and then send the conference video to the recording subsystem 140. In some implementations, the main interface 200 also carries a capture control 260 through which the user can capture conference video in real-time. The communication unit 122 then sends the conference video to the recording subsystem.

The communication unit 142 is used for receiving the conference video transmitted by the interactive subsystem 120. The processing unit 144 is configured to determine an action region of the target action in the conference video. For example: the processing unit 144 may determine, through image recognition techniques, an action region in which a target action occurs in the conference video, where the target action may be an arm action (e.g., sign language) or a lip action (e.g., lip language). When the processing unit 144 identifies that the target motion is present in the conference video, it then determines the motion region of the target motion in the conference video. Referring to fig. 3, a schematic diagram of an action area provided in the embodiment of the present application is shown. Taking target action as an example of sign language, as shown in fig. 3, a conference video interface 300 includes a plurality of participants 310. Upon identifying the target action 320, the processing unit 144 may determine an area 330 of the target action.

To further determine the performer of the target action, the processing unit 144 determines the identity of the target action based on the facial image information of the facial region closest to the action region. The identifier is used to describe the performer of the target action, and may be, for example, a name, a job number, or the like.

In some implementations, the processing unit 144 determines the center point of the action region and the center points of N face regions, where N ≧ 1, N is an integer. For convenience of understanding, taking N greater than 1 as an example, please refer to fig. 4, which is a schematic diagram of determining a face region closest to an action region according to an embodiment of the present application. The video interface 300 of the conference video includes N face regions 3401-340N, where the center points of the motion regions are M, and the center points of the N face regions are F1-FN. The processing unit 144 then determines the distance from the center point M of the motion region to the center point F1-FN of each face region, and then determines the center point Fx of the face region that is the smallest distance from the center point M of the motion region. For example: fx may be F1.

After the processing unit 144 determines the center point F1 of the face region, the face image information of the face region is further obtained, and the identifier of the target action is determined according to the face image information. Specifically, the processing unit 144 may obtain a mapping relationship between the face image information of the participant and the identifier of the participant, which is stored in advance, from the database, then determine the identifier of the participant corresponding to the face image information of the face area according to the mapping relationship, and use the identifier of the participant as the identifier of the target action, that is, the target action is executed by the participant.

In another implementation, N is equal to 1, the processing unit 144 obtains face image information of a face region, and determines the identifier of the target action according to the face image information.

The generating unit 146 is configured to generate the conference content according to the identification of the target action and the expression information of the target action. Specifically, the processing unit 144 is further configured to obtain a mapping relationship between the action and the expression information from the database, and determine target expression information corresponding to the target action according to the mapping relationship. Then, the generation unit 146 binds the identifier of the target action with the target expression information corresponding to the target action, thereby generating the conference content. The expression information may be the meaning of the target action to be expressed, and in the example of the target action 320, the expression information of the target action 320 is "us".

In some implementations, after the generation unit 146 generates the conference content, the communication unit 142 generates the conference content to the interaction subsystem 120. The display unit 124 is also used to present the conference content through the GUI, and thus the conference recording person can browse the conference content or correct the position of the recording error.

In some implementations, the communication unit 122 is further configured to receive configuration information of the user, the configuration information being used to describe the file format. Specifically, the user inputs configuration information through the GUI provided by the display unit 124, and the communication unit 122 transmits the configuration information to the recording subsystem 140.

After the communication unit 142 receives the configuration information sent by the interactive subsystem, the processing unit 144 derives the conference content according to the file format indicated by the configuration information. Therefore, the user can export the conference content according to the needed file format according to the personalized requirement of the user.

Next, in order to make the technical solution of the present application clearer and easier to understand, a detailed description will be given below of a recording method of conference content provided in an embodiment of the present application from the perspective of the recording system 100.

Referring to a flowchart of a recording method of conference contents shown in fig. 5, the method includes:

s501: recording system 100 receives conference video.

The recording system 100 may receive conference video through the GUI. The conference video may be a pre-recorded video or a video recorded in advance. The participants and the target actions of the participants are recorded in the conference video. The participant expresses the speaking content of the participant through a specific target action.

S502: recording system 100 determines an action region of the target action in the conference video.

The recording system 100 determines an action region of a target action in a conference video in the conference video. For example: the recording system 100 may determine a motion region in the conference video where a target motion occurs through image recognition technology, wherein the target motion may be an arm motion (e.g., sign language) or a lip motion (e.g., lip language). When the recording system 100 identifies that a target action is present in the conference video, then the action area of the target action in the conference video is determined. For a detailed process, please refer to fig. 3, which is not described herein.

S503: the recording system 100 determines the identifier of the target action according to the face image information of the face region closest to the action region.

The recording system 100 determines the center point of the action area and the center points of the N face areas; wherein N is not less than 1, and N is an integer. When the value of N is different, the recording system has different processing procedures. The following is presented in two cases.

The first method comprises the following steps: n is 1.

The recording system 100 obtains first face image information of a face region, and then determines the identifier of the target action according to the first face image information. Since only one face region exists in the target video, it can be determined that the target action is made by the participant corresponding to the face information of the face region. In this manner, the recording system 100 is able to determine the performer of the target action, i.e., by which participant the target action was made.

And the second method comprises the following steps: n is 2.

The recording system 100 obtains the distance between the center point of the action area and the center point of each face area, determines the center point of the face area closest to the center point of the action area, and obtains second face image information corresponding to the center point of the face area closest to the center point of the action area; and determining the identifier of the target action according to the second face image information. For a specific process, reference may be made to the description of fig. 4, which is not described herein again.

Wherein the identification of the target action comprises the name of the target object or the job number of the target object.

S504: the recording system 100 generates the conference content according to the identifier of the target action and the target expression information of the target action.

Specifically, the recording system 100 obtains a mapping relationship between the action and the expression information, determines target expression information corresponding to the target action according to the mapping relationship, and binds the target expression refinement and the identifier of the target action, thereby generating the conference content.

The application provides a recording method of conference content, which comprises the following steps: receiving a conference video; determining an action area of a target action in the conference video; determining the identification of the target action according to the face image information of the face area closest to the action area; and generating conference content according to the identification of the target action and the target expression information of the target action. In the method, an action area of a target action in a video image is determined through an image recognition technology, then a face area close to the action area is determined, face image information of the face area is obtained, and an identifier corresponding to the face image information, namely an identifier of the target action, is determined and is used for describing an executor of the target action. And recording the conference content by identifying the target action, determining expression information of the target action and corresponding the expression information of the target action with the executor of the target action. The method realizes the automatic recording of the conference content of the deaf-mute, reduces the workload of the conference recording personnel and improves the recording efficiency of the conference content.

An embodiment of the present application further provides a method for recording meeting content, and referring to fig. 6, the method further includes, on the basis of S501 to S504:

s601: recording system 100 presents the meeting content.

The recording system 100 may present the meeting content through a GUI. Specifically, the conference contents may be as shown in table 1 below.

Table 1: meeting content examples

Participant A	How do our previous stage tasks complete?
		Participant B	And (5) basically completing.
Participant C	Not completed.
		……	...…

S602: recording system 100 derives the meeting content.

In some implementations, the recording system 100 can also receive configuration information for the user that specifies a file format for conference content export, thereby enabling the user to customize the export format of the conference content.

In other implementations, the recording system 100 derives the conference content using default configuration information if the user does not input configuration information, for example, the default configuration information indicates that the derived file format is a table, and the recording system 100 derives the conference content in the table when the user does not input configuration information.

It should be noted that, S601 and S602 are not in the execution sequence, and S601 may be executed first, S602 may be executed first, or S601 and S602 may be executed simultaneously. The present embodiment does not limit the execution order of S601 and S602.

The method for recording the conference content provided by the embodiment of the present application is described above with reference to fig. 1 to 6, and the recording apparatus for the conference content and the computing device for implementing the function of the recording apparatus for the conference content provided by the embodiment of the present application are described next with reference to the accompanying drawings.

As shown in fig. 7, an embodiment of the present application further provides a device 700 for recording meeting content, where the device 700 is configured to execute the method for recording meeting content. The embodiment of the present application does not limit the division of the functional modules in the apparatus 700, and the following exemplary provides a division of the functional modules:

the recording apparatus 700 of the conference content comprises a receiving module 702, a processing module 704 and a generating module 706.

The receiving module 702 is configured to receive a conference video;

the processing module 704 is configured to determine an action region of a target action in the conference video; determining the identification of the target action according to the face image information of the face area closest to the action area;

the generating module 706 is configured to generate the conference content according to the identifier of the target action and the target expression information of the target action.

The recording apparatus 700 of the conference content may be implemented by a computing device. Fig. 8 provides a computing device, and as shown in fig. 8, a computing device 800 may be specifically used to implement the functions of the recording apparatus 700 for conference content in the embodiment shown in fig. 7.

Computing device 800 includes a bus 801, a processor 802, a display 803, and a memory 804. Communication between the processor 802, memory 804, and display 803 occurs via the bus 801.

The processor 802 may be any one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Micro Processor (MP), a Digital Signal Processor (DSP), and the like.

The display 803 is an input/output (I/O) device. The device can display electronic documents such as images and characters on a screen for a user to view. The display 803 may be classified into a Liquid Crystal Display (LCD), an Organic Light Emitting Diode (OLED) display, and the like according to a manufacturing material. In particular, the display 803 may receive conference video through the GUI.

The memory 804 may include volatile memory (volatile memory), such as Random Access Memory (RAM). The memory 804 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, a hard drive (HDD) or a Solid State Drive (SSD).

The memory 804 stores executable program codes, and the processor 802 executes the executable program codes to perform the aforementioned recording method of the conference contents. In particular, the processor 802 executes the program code described above to control the display 803 to receive conference video via the GUI. The processor 802 then determines an action region of the target action in the conference video; determining the identification of the target action according to the face image information of the face area closest to the action area; and generating conference content according to the identification of the target action and the target expression information of the target action.

The embodiment of the application also provides a computer readable storage medium. The computer-readable storage medium can be any available medium that a computing device can store or a data storage device, such as a data center, that contains one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), among others. The computer-readable storage medium includes instructions that instruct a computing device to perform the above-described recording method of conference content.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described system embodiments are merely illustrative, and the units and modules described as separate components may or may not be physically separate. In addition, some or all of the units and modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The foregoing is merely a preferred embodiment of the present application and is not intended to limit the present application in any way. Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application. Those skilled in the art can now make numerous possible variations and modifications to the disclosed embodiments, or modify equivalent embodiments, using the methods and techniques disclosed above, without departing from the scope of the claimed embodiments. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present application still fall within the protection scope of the technical solution of the present application without departing from the content of the technical solution of the present application.

Claims

1. A method for recording conference content, comprising:

receiving a conference video;

determining an action area of a target action in the conference video;

2. The method of claim 1, wherein the determining the identity of the target action according to the face image information of the face region closest to the action region comprises:

3. The method according to claim 1 or 2, wherein the generating conference content according to the identification of the target action and the target expression information of the target action comprises:

acquiring a mapping relation between actions and expression information;

4. The method of any of claims 1 to 3, wherein the identification of the target action comprises a name of the target object or a job number of the target object.

5. The method according to any one of claims 1 to 4, wherein the target action comprises an arm action and/or a lip action.

6. The method according to any one of claims 1 to 5, further comprising:

and presenting the conference content.

7. The method according to any one of claims 1 to 6, further comprising:

receiving configuration information of a user;

8. An apparatus for recording conference content, comprising: the device comprises a receiving module, a processing module and a generating module;

the receiving module is used for receiving the conference video;

9. An apparatus for recording conference content, the apparatus comprising: a memory and a processor;

the processor, executing the method of any one of claims 1 to 7 according to instructions in the computer program.

10. A computer readable storage medium for storing computer software instructions which, when run on a computer, cause the computer to perform the method of any of claims 1 to 7.