CN111193890A

CN111193890A - Conference record analyzing device and method and conference record playing system

Info

Publication number: CN111193890A
Application number: CN201811353598.7A
Authority: CN
Inventors: 曹永刚; 周文; 顾炯
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2018-11-14
Filing date: 2018-11-14
Publication date: 2020-05-22
Anticipated expiration: 2038-11-14
Also published as: CN111193890B

Abstract

The invention provides a conference record playing system, which is provided with a conference record analyzing device and is used for processing the conference record; and the conference record playing system is used for playing the processed conference record so as to enable the watching users to watch the conference record. The conference recording and analyzing device is provided with a voice recognition part for translating the recorded conference text to generate the speaking words of each participant, so that text information can be formed to facilitate the viewing of the content of the speaking sound of a user; the text correspondence processing part is also provided for corresponding the speech texts with the speaking participants, so that the viewing user can know the speaking participants when viewing the speech texts. The conference recording and playing device has a playing control part, and when the user clicks the speaking character, the image playing part and the sound playing part are controlled to jump the current playing time of the conference recording to the clicked speaking character speaking time.

Description

Conference record analyzing device and method and conference record playing system

Technical Field

The invention relates to a conference record analysis device, a corresponding conference record analysis method and a conference record playing system comprising the conference record analysis device.

Background

The conference record is used for recording the information of the discussion of the conference participants in the conference, providing accurate basis for the conference information, and simultaneously, the corresponding personnel can review the conference information, thereby avoiding the loss or forgetting of the discussion content in the conference.

In the past, a common conference recording method is manual conference recording, that is, recording information of a conference by using characters by a bookkeeper.

The above method has the disadvantages that the detail and intuition are not enough, and important information such as writing on a blackboard and a projection screen in the conference is easy to be lost, so that the conference recording is usually difficult to restore the information of the whole conference.

In order to overcome the above disadvantages and record conference information more vividly, a conference recording mode through video is presented in the prior art, that is, conference information is stored by recording speaking sound and speaking image of participants in a conference. However, in this way, when a user watching the video records needs to view specific content (for example, the speaking content of a specific person, or the speaking content of a participant for a specific subject), the user needs to browse the whole video to find the required content, which results in a great deal of time and effort for the watching user and reduces the work efficiency.

Disclosure of Invention

The present invention has been made to solve the above-mentioned problems, and an object of the present invention is to provide a conference recording and analyzing apparatus that can recognize speech information of participants in an image by text without requiring a user to browse the entire image.

In order to achieve the purpose, the invention adopts the following technical scheme:

the present invention provides a conference record analysis device for analyzing a conference record including a panoramic image and a speech sound recorded by panoramic imaging to acquire a speech record, the conference record analysis device including: the conference recording storage part stores conference records, the face recognition part analyzes panoramic images in the conference records to acquire different face characteristics of each participant and recognize the face of each participant, the voice conversion part converts speech sounds into corresponding speech words according to time, the word correspondence processing part corresponds the speech words to the corresponding participants according to the speech time of the participants, and the speech record storage part correspondingly stores the speech words, the speech time and the corresponding participants.

The present invention also provides a conference record analysis method for analyzing a conference record which is recorded by panoramic imaging and contains a panoramic image and a speech sound to acquire a speech record, the method comprising the steps of: a conference record storage step of storing the conference record; a face recognition step, namely analyzing the panoramic image in the conference record to obtain different face characteristics of each participant and recognizing the face of each participant; a voice conversion step of converting the speech sound into corresponding speech words according to time; a word corresponding processing step, wherein the speaking words are corresponding to the corresponding participants according to the speaking time of the participants; and a speech record storage step, in which the speech words, the speech time and the corresponding participants are correspondingly stored.

Action and Effect of the invention

According to the conference recording analysis device and method, the recorded conference text is translated by the voice recognition part, and the speaking characters and speaking time of each participant are generated, so that text information can be formed to facilitate the viewing of the content of speaking sound of a user; the conference recording processing system is characterized in that the conference recording processing system comprises a text corresponding processing part, a conference recording processing part and a conference recording processing part, wherein the text corresponding processing part corresponds speaking texts with participants who speak, so that the faces of the participants in the conference recording correspond to the speeches of the participants, and therefore, in the processed conference recording, when a watching user watches conference recording images and speech sounds, all text information and corresponding speakers in the conference recording can be directly browsed, and the watching user can conveniently and quickly know and inquire the whole conference recording information; meanwhile, the conference scene can be restored to a greater extent by using the recorded image recorded by the panoramic camera.

Drawings

Fig. 1 is a schematic diagram of a configuration of a conference record playing system in an embodiment of the present invention;

fig. 2 is a schematic configuration diagram of a conference record analysis apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a word correspondence processing unit according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a configuration of a conference recording and playing apparatus according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating recording a playback screen according to an embodiment of the present invention;

FIG. 6 is a flow diagram of a meeting record parsing process in an embodiment of the invention; and

fig. 7 is a flowchart of an automatic adjustment process of a playing angle of view according to an embodiment of the present invention.

Detailed Description

In order to make the technical means, creation features, achievement objects and effects of the present invention easy to understand, the following describes the conference recording and playing system of the present invention in detail with reference to the embodiments and the accompanying drawings.

The present invention provides a conference record analysis device for analyzing a conference record which is recorded by panoramic imaging and contains a panoramic image and a speech sound to acquire a speech record, the conference record analysis device comprising: the conference recording storage part stores conference records, the face recognition part analyzes panoramic images in the conference records to acquire different face characteristics of each participant and recognize the face of each participant, the voice conversion part converts speech sounds into corresponding speech words according to time, the word correspondence processing part corresponds the speech words to the corresponding participants according to the speech time of the participants, and the speech record storage part correspondingly stores the speech words, the speech time and the corresponding participants.

As a first aspect, the present invention provides a conference record analysis device, further comprising: the conference system comprises a face recognition part and a face storage part, wherein the face recognition part is used for recognizing the number of the participants and setting a corresponding number of pieces of recognition information according to the number of the participants, the face storage part is used for correspondingly storing the recognition information and the faces of the participants, and the speech record storage part is used for correspondingly storing speech characters, speech time and the corresponding recognition information of the participants.

In the first aspect, the present invention may further include: the system comprises an organization personnel management database and a face retrieval part, wherein the organization personnel management database at least stores identification information of each organization personnel and corresponding face images, the face retrieval part retrieves the organization personnel management database according to faces of the participants identified by the face identification part to obtain the identification information of each participant, and the speech record storage part correspondingly stores speech words, speech time and corresponding identification information of the participants.

In the first aspect, the present invention may further have the following features: wherein, the word correspondence processing part comprises: the voice recognition unit analyzes the speech sound in the conference record to obtain the voiceprint characteristics of each participant, so that the voiceprints of different participants are recognized; the voiceprint storage unit is used for correspondingly storing the voiceprint and the corresponding participant; a speech time division unit which divides each speech in speech sound into different speech sound parts according to speech pauses in the speech characters; and the identification corresponding unit is used for sequentially identifying the voiceprints in the speech sound parts according to the voiceprint storage unit and judging the corresponding participants from the voiceprint storage unit according to the speech voiceprints.

In the first aspect, the present invention may further have the following features: wherein, the word correspondence processing part comprises: a speech image part dividing unit which divides each speech in the panoramic image into different speech image parts according to speech pauses in the speech texts; the in-image-part face recognition unit intercepts the face image in the speech image part and recognizes each face as the face in the image part; a mouth shape conversion judging unit which respectively judges mouth shape conversion of the human face in each image part in the speaking image part so as to obtain the mouth shape conversion times of the participant corresponding to the human face in each image part in the speaking time; a speaker determination unit that counts the number of times of mouth shape conversion of the face in each of the image portions of the speech image portion and determines the face in the image portion having the largest number of times of mouth shape conversion as the speech face in the speech image portion; and the identification corresponding unit is used for sequentially judging the participants corresponding to the speaking time according to the speaking faces in the speaking image parts.

As a second aspect, the present invention provides a conference recording playback system for processing a conference recording that is recorded by panoramic imaging and that includes a panoramic image and a speech sound, and playing back the processed conference recording, the system comprising: the conference record analysis device is used for processing the conference record; the conference record analysis device is provided with a conference record storage part, a face recognition part, a voice conversion part, a character corresponding processing part and a speaking record storage part, wherein the conference record storage part stores conference records, the face recognition part analyzes panoramic images in the conference records to acquire different face characteristics of each participant and recognize the face of each participant, the voice conversion part converts speaking sounds into corresponding speaking characters according to time, the character corresponding processing part corresponds the speaking characters to the corresponding participants according to speaking time of the participants, and the speaking record storage part correspondingly stores the speaking characters, speaking time and the corresponding participants; the conference recording and playing device comprises a picture storage part, a playing control part, an input display part, an image playing part and a sound playing part, wherein the picture storage part stores a recording and playing picture, the record playing picture has an image display part and a character display part, the playing control part controls the input display part to display the record playing picture and controls the image playing part to play the panoramic image in the image display part, and controlling the sound playing part to play the speech sound, further displaying the speech characters and the speech time in the character display part according to the speech time so as to enable a watching user to click the speech characters, once the watching user clicks the speech characters, the playing control part controls the panoramic image and the speech sound played by the image playing part and the sound playing part according to the speech time corresponding to the clicked speech characters, and accordingly the current playing time of the panoramic image and the speech sound corresponds to the speech time.

In the second embodiment, the following features may be provided: the conference record playing device is further provided with a time progress retrieval part and a face coordinate acquisition part, the image playing part is a panoramic image playing part which plays panoramic images according to different viewing angles, the playing control part controls the time progress retrieval part according to preset refreshing time and retrieves speaking time in the speaking record storage part and corresponding participants according to current playing time, the face coordinate acquisition part is controlled to analyze the panoramic images in the conference records according to the participants to acquire coordinate information corresponding to the faces of the participants, and the panoramic image playing part is further controlled to adjust the playing viewing angle center in the panoramic images to a viewing angle with the coordinate information as the center according to the coordinate information corresponding to the speaking faces.

The invention provides a conference record analysis method for analyzing a conference record which is recorded by panoramic shooting and contains a panoramic image and speech sound so as to obtain a speech record, which is characterized by comprising the following steps: a conference record storage step of storing the conference record; a face recognition step, namely analyzing the panoramic image in the conference record to obtain different face characteristics of each participant and recognizing the face of each participant; a voice conversion step of converting the speech sound into corresponding speech words according to time; a word corresponding processing step, wherein the speaking words are corresponding to the corresponding participants according to the speaking time of the participants; and a speech record storage step, in which the speech words, the speech time and the corresponding participants are correspondingly stored.

< example >

Fig. 1 is a schematic diagram of a configuration of a conference recording and playing system in an embodiment of the present invention.

As shown in fig. 1, the conference record playback system 100 of the present embodiment includes a conference record analysis device 1, a conference record playback device 2, and a communication network 3.

The conference record analysis apparatus 1 and the conference record playback apparatus 2 are communicatively connected via a communication network 3. The conference record analysis device 1 is configured to process a conference record, and the conference record playing device 2 is configured to play the conference record processed by the conference record analysis device 1.

Fig. 2 is a schematic configuration diagram of a conference record analysis apparatus in an embodiment of the present invention.

As shown in fig. 2, the conference record analysis device 1 includes a conference record storage unit 11, a face recognition unit 12, a facility staff member management database 13, a face search unit 14, a voice conversion unit 15, a word correspondence processing unit 16, an utterance record storage unit 17, an analysis-side communication unit 18, and an analysis-side control unit 19.

Among them, the analysis-side communication unit 18 exchanges data between the respective components of the conference record analysis apparatus 1 and between the conference record analysis apparatus 1 and another apparatus, and the analysis-side control unit 19 controls operations of the respective components of the conference record analysis apparatus 1.

The conference record storage unit 11 is used to store a conference record including a panoramic image and a speech sound recorded by a panoramic recording device (for example, a panoramic camera). The panoramic image is a conference video recording which has a 360-degree panoramic view field and can be subjected to plane processing and is developed through a panoramic video development algorithm (for example, three. js algorithm in the prior art is used for developing the panoramic video in a three-dimensional scene in a browser), the speaking sound is audio recorded with speaking sound of a conference of participants, and the time axes of the panoramic image and the speaking sound correspond to each other.

The face recognition unit 12 is configured to analyze the panoramic image in the meeting record to obtain different face features of each participant, and recognize a face of each participant.

In this embodiment, the face recognition unit 12 obtains each face feature and recognizes a face in the panoramic image according to the face feature, thereby confirming each different participant. In other embodiments, the face recognition unit 12 may also record the coordinate range of the face so as to cooperate with face feature recognition to identify and confirm the participants sitting at different positions.

The organization personnel management database 13 stores at least identification information of each organization personnel and corresponding face images. In this embodiment, the identification information is the employee number of the organization personnel, the organization personnel management database 13 further stores names (corresponding to the employee number, respectively) of the organization personnel, and the face image is obtained by acquiring the face image of each organization personnel.

The face retrieval unit 14 retrieves the organization personnel management database 13 from the faces of the participants identified by the face identification unit 12 to obtain identification information of each participant. The speech conversion section 15 converts speech sound into corresponding speech characters according to time.

In this embodiment, the speech conversion unit 15 first acquires the speech sound (i.e., the audio portion in the conference recording) from the conference recording storage unit 11, then segments the speech sound according to the speech time and the speech pause in the speech sound (e.g., the pause between sentences spoken by participants in the audio portion), records the start time and the end time of each speech segment as the speech time, and further converts the speech in each speech segment into speech words by speech conversion.

In another embodiment, the speech conversion unit 15 may also be able to generate speech words by calling an external speech conversion resource (e.g., some open-source speech conversion website resource) by the analysis-side communication unit 18 to convert speech in each speech segment.

The word correspondence processing unit 16 corresponds the utterance word to the participant according to the utterance time of the participant.

FIG. 3 is a block diagram of a word correspondence processing unit according to an embodiment of the present invention.

As shown in fig. 3, the character correspondence processing unit 16 is a processing unit that performs correspondence between uttered characters and uttered participants based on a panoramic image, and includes utterance image portion dividing means 161, in-image-portion face recognition means 162, mouth shape conversion determination means 163, utterer determination means 164, recognition correspondence means 165, and correspondence control means 166.

The correspondence control unit 166 controls the operations of the respective components of the character correspondence processing unit 16.

The utterance image division unit 161 is configured to divide each utterance in the panoramic image into different utterance image portions according to utterance pauses in the utterance text, where each utterance image portion has time information (i.e., start and stop times) and a panoramic image including faces of participants.

The in-image-section face recognition unit 162 cuts out the face images in the utterance image section divided by the utterance image section dividing unit 161 and recognizes each face as a face in the image section. In this embodiment, the in-image-section face recognition unit 162 collects image frames in a speech image section according to the time information of the speech image section and a predetermined interval sampling time, cuts the faces of the image frames, and divides the same faces in all the image frames into one group, so as to obtain a plurality of groups of face image frames corresponding to the participants respectively. In addition, a single human face image frame in each group of human face image frames also comprises interception time point information of the corresponding image frame.

The mouth shape conversion determination unit 163 determines the mouth shape conversion of each group of face image frames in the utterance image section, respectively, so as to obtain the number of mouth shape conversion times of the participant within the utterance time for the participant corresponding to each group of face image frames.

In this embodiment, the mouth shape transformation determining unit 163 sequentially analyzes feature points representing upper and lower lips in each face image frame in each group of face image frames by a face feature detection algorithm (for example, a Dlib algorithm in the prior art). And if the up-down movement distance of the upper lip feature point and the lower lip feature point of two continuous front and back human face image frames in a group of human face image frames exceeds 1/2 of the lips of the participant, judging the number of times of mouth opening once. Then, counting all the times of mouth opening in each group of face image frames in sequence, thereby obtaining the times of mouth shape transformation of each group of face image frames.

The speaker determination unit 164 counts the number of times of mouth shape conversion of each group of face image frames in each speech image portion, and determines the face corresponding to the group of face image frames with the largest number of times of mouth shape conversion as the speech face in the speech image portion.

The recognition correspondence unit 165 sequentially determines participants corresponding to the respective speaking times based on the faces of the speakers determined by the speaker determination unit 164.

The utterance record storage unit 17 stores the utterance characters, the utterance time, and the identification information of the relevant participant in association with each other.

Fig. 4 is a block diagram of a structure of a conference recording and playing apparatus in an embodiment of the present invention.

As shown in fig. 4, the conference record playback device 2 includes a screen storage unit 21, a playback control unit 22, an input display unit 23, an image playback unit 24, a sound playback unit 25, a schedule retrieval unit 26, a face coordinate acquisition unit 27, a playback-side communication unit 28, and a playback-side control unit 29.

Among them, the broadcast-side communication unit 28 exchanges data between the respective components of the conference recording/reproducing device 2 and between the conference recording/reproducing device 2 and another device, and the broadcast-side control unit 29 controls the operations of the respective components of the conference recording/reproducing device 2.

The screen storage unit 21 stores a recording/playback screen. As shown in fig. 5, the record playback screen has an image display portion and a text display portion for displaying a panoramic image of a conference record and all the speech texts in the conference record when a viewing user selects a conference record to view.

The playing control unit 22 is used for controlling the operations of the components related to the recording and playing process in the conference recording and playing device 2, including the operations related to the recording and playing process of the input display unit 23, the image playing unit 24, the sound playing unit 25, the time schedule retrieving unit 26, and the face coordinate obtaining unit 27.

Specifically, when the viewing user selects one conference record to play, the playback control section 22 controls the input display section 23 to display a record playback screen, controls the image playback section 24 to play a panoramic image in the image display section, controls the sound playback section 25 to play a speech sound in synchronization with the played panoramic image, and further displays speech characters and a speech time in the character display section according to the speech time to allow the viewing user to click the speech characters.

In this embodiment, the text display portion is a scrollable text box, and all the spoken texts recorded in the currently played conference are displayed in the text display portion and are correspondingly scrolled along with the recorded playing time, so that the spoken texts at the current time are always located in the middle of the text display portion. The viewing user can browse the speech letters by dragging the scroll bar in the letter display section through the input display section 23.

The text display section displays the speech text, and displays the corresponding speech time, the face image of the speaking participant (acquired from the organization personnel management database 13), and the name of the speaking participant (acquired from the organization personnel management database 13) near each speech text.

When the viewing user clicks a certain utterance text, the playback control unit 22 controls the playback progress of the panoramic image and the utterance sound played by the image playback unit 24 and the sound playback unit 25 based on the utterance time corresponding to the clicked utterance text, so that the current playback time of the panoramic image and the utterance sound is shifted to the clicked utterance text speaking time.

In the present embodiment, the image playback section 24 is a panoramic image playback section that plays a panoramic image according to different viewing angles, and the panoramic image playback section is capable of displaying a partial image of the panoramic image according to a playback viewing angle in the image display section.

When the image playback section 24 plays back the panoramic image, the playback control section 22 controls the schedule retrieval section 26 to retrieve the utterance time and the corresponding participant in the utterance record storage section 17 as an utterance participant according to the preset refresh time.

When the schedule retrieval unit 26 retrieves and acquires the speaking participant, the playback control unit 22 controls the face coordinate acquisition unit 27 to analyze the currently played panoramic image based on the acquired speaking participant, thereby acquiring coordinate information of the face of the speaking participant.

When the face coordinate acquiring unit 27 acquires the coordinate information, the playback control unit 22 controls the panoramic image playback unit to adjust the center of the playback angle in the panoramic image to the playback angle centered on the coordinate information, based on the coordinate information corresponding to the face of the speaker.

In this embodiment, the play control unit 22 has a timing unit, and is configured to remind the preset refresh time, where the preset refresh time is 0.5 s. When the image playing part 24 plays the panoramic image, the playing control part 22 controls the panoramic image playing part to adjust the playing angle of view every 0.5s, so that the speaking participant is always positioned at the center of the image display part in the recording and playing picture.

In other embodiments, the viewing user can adjust the playing angle of the panoramic image played by the panoramic image playing unit through the input display unit 23 (for example, by dragging with a mouse, the playing angle of the panoramic image is directly adjusted to the angle corresponding to the projection screen or the conference book to be viewed).

Fig. 6 is a flowchart of a process of parsing a conference record in an embodiment of the present invention.

As shown in fig. 6, the conference record analysis process is a process in which the conference record analysis device 1 analyzes a conference record.

When a new conference record is stored in the conference record storage unit 11, the conference record analysis device 1 analyzes the conference record, and then starts the following steps:

step S1-1, the face recognition part 12 obtains the panoramic image stored in the conference record storage part 11, analyzes the panoramic image to recognize different participants according to the face characteristics of different faces, and then enters step S1-2;

step S1-2, the voice converting part 15 obtains the speaking voice stored in the conference recording storing part 11, converts the speaking voice into speaking characters according to time and records the corresponding speaking time, and then the step S1-3 is proceeded;

step S1-3, the word correspondence processing part corresponds the speaking word converted in the step S1-2 with the participant according to the corresponding speaking time, and then the step S1-4 is carried out;

step S1-4, the face retrieval part 14 retrieves the organization personnel management database 13 according to the face identified in the step S1-1 to obtain the identification characteristics of the participant, and then the step S1-5 is carried out;

and S1-5, correspondingly storing the speaking characters identified in the step S1-2, the corresponding speaking time and the identification characteristics of the participant acquired in the step S1-4, and then finishing the step.

As shown in fig. 7, in the process of playing the conference record by the conference record playing apparatus 2, the playing control unit 22 controls the panoramic image playing unit to adjust the playing angle according to the preset refresh time, and the steps are as follows:

step S2-1, the timing unit of the playing control part 22 detects the refresh of the preset refresh time, and then the step S2-2 is proceeded;

step S2-2, the play control section 22 controls the schedule retrieval section 26 to retrieve the utterance time and the corresponding participant in the utterance record storage section 17 according to the current play time, and then proceeds to step S2-3;

step S2-3, the playback control unit 22 controls the face coordinate acquisition unit 27 to analyze the panoramic image in the conference record according to the attendee acquired in step S2-2 to acquire coordinate information corresponding to the face of the attendee, and then step S2-4 is performed;

in step S2-4, the playback control unit 22 controls the panoramic image playback unit to adjust the center of the playback angle of view in the panoramic image to the playback angle of view centered on the coordinate information based on the coordinate information acquired in step S2-3, and the process ends.

Examples effects and effects

According to the conference recording analysis device and method provided by the embodiment, the recorded conference text is translated by the voice recognition part, and the speaking characters and speaking time of each participant are generated, so that text information can be formed so as to facilitate the viewing of the content of speaking sound of a user; the conference record analysis device can realize automatic arrangement of the panoramic video conference record, enables a watching user to directly browse all text information and corresponding speakers in the conference record without manual arrangement, and facilitates the watching user to quickly browse and screen the content of the conference record; meanwhile, the conference scene can be restored to a greater extent by using the recorded image recorded by the panoramic camera, so that the watching users have more telepresence when watching.

In the embodiment, when the organization personnel management database is provided, the human face searching part searches the database to obtain the identification information of the participator, displays the name of the speaker on the side surface when displaying the speech characters for the person watching the video to identify,

in an embodiment, the word correspondence processing unit is capable of judging the face characteristics of the participants to correspond the speech words to the speakers, the mouth shape conversion judging unit is used for counting the number of times of mouth shape conversion of all the participants in each speech time period, so that the speaker judging unit is used for judging the speech face with the largest speech in each speech time period, and the identification corresponding unit is used for corresponding the participants to the speech words and the speech time according to the speech face and the corresponding time period.

In the conference record playing system provided by this embodiment, since the conference record playing device is provided to play the conference record processed by the conference record analyzing device, and the recording and playing screen has the text display portion, which can display all the speech texts of the current conference record, the user can visually know the information in the whole conference record through the speech texts. Meanwhile, the playing control part can control the playing progress of the video to jump according to the words of speaking clicked by the watching user, so that the user can find the corresponding image and audio contents through the words, and the browsing and screening of the conference records can be completed more conveniently and rapidly.

In the embodiment, the playing control part controls the playing visual angle of the panoramic image at regular time, the coordinate information corresponding to the faces of the participants is acquired according to the face coordinate acquisition part, and the playing control part controls the panoramic image playing part to adjust the center of the playing visual angle in the panoramic image into the visual angle taking the coordinate information as the center, so that the visual angle can be adjusted to the speaking participants in real time in the recorded playing picture, the watching users can visually see which participants are speaking, the watching users have more presence, and better conference recording and watching effects are obtained.

< modification example I >

In comparison with the embodiment, the conference record analysis device 1 according to the first modification does not include the organization personnel management database 13 and the face search unit 14, but includes one identification information providing unit and one face storage unit. In this case, the conference record analysis device 1 can authenticate the identity information of the participant by the identification information adding unit and the face storage unit, and the procedure is roughly as follows.

The identification information assigning unit sets identification information of a corresponding number based on the number of the participants recognized by the face recognition unit 12 and assigns the identification information to each of the participants.

The face storage unit stores the identification information set by the identification information giving unit and the faces of the participants in association with each other.

The utterance record storage unit 17 stores the utterance characters, the utterance time, and the identification information of the corresponding participant set by the identification information providing unit in association with each other.

When the playback control unit 22 controls the text display unit to display the speech text, the text display unit displays the speech text, the speech time corresponding to the speech text, and the face image of the speaking participant (acquired from the face storage unit).

In the first modification, since the information identification assigning unit assigns identification information to each participant according to the number of participants, temporary identification information is added to each participant without knowing the information of the participants, the face stored in the face storage unit can be used as a head image for identifying the participants, and a person viewing a video is displayed on the side of the face storage unit when displaying the speech character of the conference record to identify the identity of the person viewing the video.

< modification example two >

In comparison with the embodiment, the character association processing unit 16 according to the second modification is a processing unit that associates the utterance characters with the utterance participants based on the panoramic image and the voiceprint join analysis. In this case, the word correspondence processing unit 16 further includes a voice recognition unit, a voiceprint storage unit, and an utterance time division unit, and the word correspondence processing unit 16 realizes a combination of the two analysis methods by the determination control unit, and the procedure is roughly as follows.

The in-image-section face recognition unit 162 intercepts the face images in the utterance image section divided by the utterance image section dividing unit 161 and recognizes each face as a face in the image section, and determines whether the face images are partially occluded (that is, the corresponding mouth feature points are not acquired).

When the face recognition unit 162 in the image portion judges that the face image is blocked, the judgment control unit controls the recognition corresponding unit 165 to sequentially recognize the voiceprints in the speech sound portions divided by the speech time portion dividing unit according to the voiceprint storage unit, and judges the corresponding participant from the voiceprint storage unit according to the speech voiceprint.

In the second modification, the utterance image division unit 161 and the utterance time division unit each divide the utterance image portion or the utterance sound portion according to an utterance pause in the utterance text, so that the times of the utterance image portion and the utterance sound portion correspond to each other, and when the image-portion-inside-face recognition unit 162 determines that the face of the person in the utterance image portion is blocked, the determination control unit controls the relevant component to analyze the utterance sound portion corresponding to the utterance image portion, thereby achieving the correspondence between the utterance text and the utterance participant.

In the second modification, the character correspondence processing unit can determine the voiceprint features of the participants to correspond the speech characters to the speakers, the voice recognition unit recognizes the voiceprint features of the participants, the recognition correspondence unit can correspond the participants to the speech characters and the speech time according to the voiceprint features, when the face of a person in the panoramic image is blocked and cannot be judged, the voiceprint feature recognition can realize the judgment of the speakers under the condition, and meanwhile, the voiceprint features collected by the character correspondence processing unit and the voiceprint features used for recognition are collected in the same conference site, so that the influence of background noise during voiceprint recognition can be greatly reduced, and high-accuracy voiceprint recognition is realized.

< modification example III >

In the second modification, the character association processing unit 16 is a processing unit that associates the utterance characters with the utterance participants based on the panoramic image and the voiceprint join analysis. However, in the present invention, the character associating processing unit 16 based on only the voiceprint recognition may be adopted. Specifically, the word correspondence processing unit 16 includes a voice recognition unit, a voiceprint storage unit, an utterance time division unit, and a recognition correspondence unit 165.

The voice recognition unit analyzes the utterance voice stored in the conference record storage unit 11 to acquire the voiceprint characteristics of each participant, and recognizes the voiceprints of different participants.

And the voiceprint storage unit correspondingly stores the voiceprint identified by the voice identification unit and the corresponding participant.

The speech time division unit divides each speech in the speech sound into different speech sound parts according to the speech pause in the speech character. The speech sound part has time information and sound information of the participant, and the identification correspondence unit 165 can realize correspondence between the speech time and the participant by correspondence between the time information and the speech time and a comparison result between the sound information and a voiceprint of the participant.

The recognition correspondence unit 165 sequentially recognizes the voiceprints in the speech sound portions divided by the speech time portion dividing unit from the voiceprint storage unit, and determines the corresponding participant from the voiceprint storage unit according to the speech voiceprint.

The above-described embodiments and modifications are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

Claims

1. A conference record analysis apparatus for analyzing a conference record that is recorded by panoramic imaging and that contains a panoramic image and a speech sound to acquire a speech record, comprising:

a conference record storage part, a face recognition part, a voice conversion part, a word correspondence processing part and a speech record storage part,

the conference record storage section stores the conference record,

the face recognition part analyzes the panoramic image in the conference record to obtain different face characteristics of each participant and recognizes the face of each participant,

the voice converting part converts the speaking voice into corresponding speaking characters according to time,

the word correspondence processing section corresponds the utterance word to the corresponding participant according to the utterance time of the participant,

the speaking record storage part correspondingly stores the speaking characters, the speaking time and the corresponding participants.

2. The apparatus for analyzing a conference record according to claim 1, further comprising:

an identification information adding unit and a face storage unit,

wherein the identification information giving part sets identification information of a corresponding number according to the number of the attendees recognized by the face recognition part and gives the identification information to each attendee,

the face storage part correspondingly stores the identification information and the faces of the participants,

the utterance recording storage unit stores the utterance text, the utterance time, and the identification information of the corresponding participant in association with each other.

3. The apparatus for analyzing a conference record according to claim 1, further comprising:

an organization personnel management database and a human face retrieval part,

wherein the organization personnel management database at least stores the identification information of each organization personnel and the corresponding face image,

the face retrieval part retrieves the organization personnel management database according to the faces of the participants identified by the face identification part to obtain the identification information of each participant,

4. The conference record parsing apparatus according to any one of claims 1 to 3, wherein:

wherein, the word correspondence processing part comprises:

the voice recognition unit analyzes the speaking voice in the conference record to obtain the voiceprint characteristics of each participant, so that different voiceprints of the participants are recognized;

the voiceprint storage unit is used for correspondingly storing the voiceprint and the corresponding participant;

a speech time division unit which divides each speech in the speech sound into different speech sound parts according to the speech pause in the speech character;

and the identification corresponding unit is used for sequentially identifying the voiceprints in the speech sound parts according to the voiceprint storage unit and judging the corresponding participants from the voiceprint storage unit according to the speech voiceprints.

5. The conference record parsing apparatus according to any one of claims 1 to 3, wherein:

wherein, the word correspondence processing part comprises:

a speech image part dividing unit which divides each speech in the panoramic image into different speech image parts according to speech pauses in the speech texts;

the in-image-part face recognition unit intercepts the face image in the speech image part and recognizes each face as the face in the image part;

a mouth shape conversion judging unit which respectively judges mouth shape conversion of the human face in each image part in the speaking image part so as to obtain the mouth shape conversion times of the participant corresponding to the human face in each image part in the speaking time;

a speaker determination unit that counts the number of times of mouth shape conversion of a face in each of the image portions in the speech image portion and determines the face in the image portion having the largest number of times of mouth shape conversion as a speech face in the speech image portion;

and the identification corresponding unit is used for sequentially judging the participants corresponding to the speaking time according to the speaking face in each speaking image part.

6. A conference record playback system for processing a conference record that is recorded by panoramic imaging and that contains a panoramic image and a speech sound, and playing back the processed conference record, comprising:

the conference record analysis device is used for processing the conference record; and

a conference record playing device for playing the processed conference record to be watched by the watching users,

wherein the conference record analysis device comprises a conference record storage part, a face recognition part, a voice conversion part, a word correspondence processing part and a speech record storage part,

the conference record storage section stores the conference record,

the speaking record storage part correspondingly stores the speaking characters, the speaking time and the corresponding participants;

the conference recording and playing device comprises a picture storage part, a playing control part, an input display part, an image playing part and a sound playing part,

the picture storage part stores a record playing picture which is provided with an image display part and a character display part,

the playback control unit controls the input display unit to display the recorded playback screen, controls the image playback unit to play the panoramic image in the image display unit, controls the sound playback unit to play the speech sound, and further controls the text presentation unit to display the speech text and the speech time in the text display unit according to the speech time so that the viewing user clicks the speech text,

once the viewing user clicks the utterance text, the playback control unit controls the panoramic image and the utterance sound played by the image playback unit and the sound playback unit according to the utterance time corresponding to the clicked utterance text, so that the current playback time of the panoramic image and the utterance sound corresponds to the utterance time.

7. The system for playing back a conference recording according to claim 6, wherein:

wherein the conference record playing device is also provided with a time progress retrieval part and a face coordinate acquisition part,

the image playing part is a panoramic image playing part which plays the panoramic image according to different visual angles,

the playing control part controls the time progress retrieval part to retrieve the speaking time and the corresponding attendees in the speaking record storage part according to the current playing time according to the preset refreshing time, controls the face coordinate acquisition part to analyze the panoramic image in the conference record according to the attendees to acquire coordinate information corresponding to the faces of the attendees, and further controls the panoramic image playing part to adjust the playing view angle center in the panoramic image to a view angle taking the coordinate information as the center according to the coordinate information corresponding to the speaking face.

8. A conference record analyzing method for analyzing a conference record which is recorded by panoramic imaging and contains a panoramic image and a speech sound to acquire a speech record, comprising the steps of:

a conference record storage step of storing the conference record;

a face recognition step, in which the panoramic image in the conference record is analyzed to obtain different face characteristics of each participant and the face of each participant is recognized;

a voice conversion step of converting the speech sound into a corresponding speech character according to time;

a word correspondence processing step of corresponding the speaking words to the corresponding participants according to the speaking time of the participants;

and a speech record storage step, in which the speech words, the speech time and the corresponding participants are correspondingly stored.