CN116991808B

CN116991808B - Intelligent data storage method and device for enterprise conference

Info

Publication number: CN116991808B
Application number: CN202311258355.6A
Authority: CN
Inventors: 孙立彬
Original assignee: Nantong Huashidai Information Technology Co ltd
Current assignee: Nantong Huashidai Information Technology Co ltd
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-12-08
Anticipated expiration: 2043-09-27
Also published as: CN116991808A

Abstract

The invention provides a data intelligent storage method and a data intelligent storage device for an enterprise conference, which relate to the technical field of data processing. The method solves the technical problem that in the prior art, the enterprise conference data is directly stored without processing, so that the reusability of the enterprise conference data which is actually recorded and stored is poor. The technical effects of improving the storage safety and the availability of the enterprise conference data and improving the listening experience of the enterprise conference data when the enterprise conference data is extracted and played and the work is repeated later are achieved.

Description

Intelligent data storage method and device for enterprise conference

Technical Field

The invention relates to the technical field of data processing, in particular to an intelligent data storage method and device for enterprise conferences.

Background

Many businesses may not have the need to purchase specialized camera and microphone combination devices in terms of meeting room equipment. Therefore, they simply store audio and video data directly during the conference, without any subsequent processing or conversion into text content, which may result in different sound sizes of stored enterprise conference data, and thus reusability of conference audio and video data may be affected.

If an enterprise needs to review conference contents, extract important information or integrate with other files, the original conference video and audio data cannot be used for quickly searching specific topics, speaking by personnel or performing full text retrieval, which can limit conference records and knowledge management.

In summary, in the prior art, the enterprise conference data is directly stored without processing, which results in a technical problem of poor reusability of the enterprise conference data actually recorded and stored.

Disclosure of Invention

The application provides a data intelligent storage method and device for enterprise conferences, which are used for solving the technical problem that the reusability of enterprise conference data actually recorded and stored is poor due to the fact that the enterprise conference data are directly stored without processing in the prior art.

In view of the above problems, the application provides a data intelligent storage method and device for enterprise conferences.

In a first aspect of the present application, there is provided a method for intelligent storage of data for an enterprise meeting, the method comprising: the method comprises the steps of interactively obtaining video and audio data of a target conference, wherein the video and audio data of the target conference are obtained by synchronously recording video and audio of a target enterprise conference, and the target enterprise conference has K participants, wherein K is a positive integer; generating K reference voiceprint features, wherein the K reference voiceprint features are obtained by performing video and audio synchronous analysis on target conference video and audio data, and the K reference voiceprint features are mapped with the K participants one by one; extracting the audio track of the target conference video and audio data to obtain target audio data; obtaining K groups of parametric audio, wherein the K groups of parametric audio are obtained by performing sound source separation on the target audio data by using the K parametric voiceprint features; presetting a standard sound volume threshold, and carrying out sound volume unification on the K groups of parametric audios by adopting the standard sound volume threshold to obtain K groups of sound volume unification audios; obtaining target storage audio data, wherein the target storage audio data is obtained by time sequence reduction of the K groups of sound volume unified audios; obtaining target storage text data, wherein the target storage text data is obtained by carrying out text processing on the target storage audio data; presetting meeting data consulting authority, and storing the target storage audio data and the target storage text data into a target enterprise cloud space according to the meeting data consulting authority.

In a second aspect of the present application, there is provided a data intelligent storage apparatus for an enterprise meeting, the apparatus comprising: the conference data interaction module is used for interactively obtaining target conference video and audio data, wherein the target conference video and audio data are obtained by synchronously recording video and audio of a target enterprise conference, the target enterprise conference has K participants, and K is a positive integer; the voice print feature generation module is used for generating K reference voice print features, wherein the K reference voice print features are obtained by performing video and audio synchronous analysis on video and audio data of a target conference, and the K reference voice print features are mapped with the K participants one by one; the track extraction execution module is used for carrying out track extraction on the target conference video and audio data to obtain target audio data; the parameter audio splitting module is used for obtaining K groups of parameter audios, wherein the K groups of parameter audios are obtained by performing sound source separation on the target audio data by using the K parameter voiceprint features; the sound volume processing execution module is used for presetting a standard sound volume threshold value, and carrying out sound volume unification processing on the K groups of parametric audios by adopting the standard sound volume threshold value to obtain K groups of sound volume unification audios; the storage audio generation module is used for obtaining target storage audio data, wherein the target storage audio data is obtained by performing time sequence reduction on the K groups of sound volume unified audios to obtain storage text obtaining module which is used for obtaining target storage text data, and the target storage text data is obtained by performing text processing on the target storage audio data; and the consulting authority setting module is used for presetting meeting data consulting authority and storing the target storage audio data and the target storage text data into a target enterprise cloud space according to the meeting data consulting authority.

One or more technical schemes provided by the application have at least the following technical effects or advantages:

according to the method provided by the embodiment of the application, the target conference video and audio data are obtained through interaction, wherein the target conference video and audio data are obtained through video and audio synchronous recording of a target enterprise conference, and the target enterprise conference is provided with K participants, and K is a positive integer; generating K reference voiceprint features, wherein the K reference voiceprint features are obtained by performing video and audio synchronous analysis on target conference video and audio data, and the K reference voiceprint features are mapped with the K participants one by one; extracting the audio track of the target conference video and audio data to obtain target audio data; obtaining K groups of parametric audio, wherein the K groups of parametric audio are obtained by performing sound source separation on the target audio data by using the K parametric voiceprint features; presetting a standard sound volume threshold, and carrying out sound volume unification on the K groups of parametric audios by adopting the standard sound volume threshold to obtain K groups of sound volume unification audios; obtaining target storage audio data, wherein the target storage audio data is obtained by time sequence reduction of the K groups of sound volume unified audios; obtaining target storage text data, wherein the target storage text data is obtained by carrying out text processing on the target storage audio data; presetting meeting data consulting authority, and storing the target storage audio data and the target storage text data into a target enterprise cloud space according to the meeting data consulting authority. The technical effects of improving the storage safety and the availability of the enterprise conference data and improving the listening experience of the enterprise conference data when the enterprise conference data is extracted and played and the work is repeated later are achieved.

Drawings

FIG. 1 is a schematic flow chart of a method for intelligently storing data for enterprise conferences provided by the application;

fig. 2 is a schematic flow chart of constructing a voice print feature recognition sub-network in the data intelligent storage method for enterprise conferences provided by the application;

FIG. 3 is a schematic flow chart of sound volume unification processing in the intelligent data storage method for enterprise conferences provided by the application;

fig. 4 is a schematic structural diagram of a data intelligent storage device for enterprise conferences according to the present application.

Reference numerals illustrate: the system comprises a conference data interaction module 1, a voiceprint feature generation module 2, an audio track extraction execution module 3, a participant audio splitting module 4, a sound volume processing execution module 5, a storage audio generation module 6, a storage text acquisition module 7 and a consulting authority setting module 8.

Detailed Description

The application provides a data intelligent storage method and device for enterprise conferences, which are used for solving the technical problem that the reusability of enterprise conference data actually recorded and stored is poor due to the fact that the enterprise conference data are directly stored without processing in the prior art. The technical effects of improving the storage safety and the availability of the enterprise conference data and improving the listening experience of the enterprise conference data when the enterprise conference data is extracted and played and the work is repeated later are achieved.

The technical scheme of the application accords with related regulations on data acquisition, storage, use, processing and the like.

In the following, the technical solutions of the present application will be clearly and completely described with reference to the accompanying drawings, and it should be understood that the described embodiments are only some embodiments of the present application, but not all embodiments of the present application, and that the present application is not limited by the exemplary embodiments described herein. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. It should be further noted that, for convenience of description, only some, but not all of the drawings related to the present application are shown.

Example 1

As shown in fig. 1, the present application provides a method for intelligently storing data for enterprise conferences, the method comprising:

a100, interactively obtaining video and audio data of a target conference, wherein the video and audio data of the target conference are obtained by synchronously recording video and audio of a target enterprise conference, and the target enterprise conference has K participants, wherein K is a positive integer;

specifically, in this embodiment, during a meeting of a target enterprise, a meeting room camera and a radio combination device are adopted to perform video and audio synchronous acquisition and recording on the meeting of the target enterprise, so as to obtain video and audio data of the target meeting, theoretically, each participant in the meeting of the target enterprise will speak, and K participants in the meeting of the target enterprise are total, where K is a positive integer.

Because the conference room camera and the radio combined equipment are only one set, in the acquired target conference video and audio data, the participants are limited by the distance between the conference room camera and the equipment, and the speaking sound quantity of the participants at different positions is different in the target conference video and audio data.

A200, generating K participant voiceprint features, wherein the K participant voiceprint features are obtained by performing video and audio synchronous analysis on video and audio data of a target conference, and the K participant voiceprint features are mapped with the K participants one by one;

in one embodiment, K reference voiceprint features are generated, where the K reference voiceprint features are obtained by performing video-audio synchronization analysis on target conference video-audio data, and the K reference voiceprint features are mapped to the K participants one by one, and the method step a200 provided by the present application further includes:

a210, presetting conference audio collection nodes, and carrying out audio collection on the target audio data based on the conference audio collection nodes to obtain M sections of local audio data, wherein the M sections of local audio data are provided with M local audio collection nodes, and M is a positive integer;

a220, presetting a voiceprint feature extraction rule, traversing the M sections of local audio data based on the voiceprint feature extraction rule, and obtaining M groups of voiceprint feature parameters;

A230, carrying out aggregation treatment on the M groups of voiceprint feature parameters to obtain N reference voiceprint features and N feature acquisition nodes, wherein N is a positive integer less than or equal to K;

a240, judging whether the feature quantity of the N participant voiceprint features meets the K participants;

a250, extracting N conference video information from the target conference video and audio data based on the N feature acquisition nodes if the feature quantity of the N conference voiceprint features meets the K participants;

a260, performing behavior feature analysis on the N conference video information, and positioning to obtain N speaking participants;

and A270, carrying out identity recognition on the N speaking participants, and constructing association mapping between the participant identities and the N participant voiceprint features to obtain the K participant voiceprint features.

In one embodiment, it is determined whether the feature number of the N reference voiceprint features satisfies the K participants, and the method step a240 provided in the present application further includes:

a241, judging whether the feature quantity of the N participant voiceprint features meets the K participants;

a242, if the feature quantity of the N conference voice print features does not meet the K conference participants, performing secondary audio acquisition on the target audio data based on the conference audio acquisition nodes to obtain M sections of second local audio data, wherein the M sections of second local audio data are provided with M second local audio acquisition nodes;

A243, traversing the M sections of second local audio data based on the voiceprint feature extraction rules to obtain M groups of second voiceprint feature parameters;

a244, carrying out aggregation treatment on the M groups of voiceprint characteristic parameters and the M groups of second voiceprint characteristic parameters to obtain H types of reference voiceprint characteristics and H characteristic acquisition nodes, wherein H is a positive integer less than or equal to K;

a245, judging whether the characteristic quantity of the H participant voiceprint characteristics meets the K participants;

a246, by analogy, carrying out audio collection and voiceprint feature analysis on the target audio data for multiple times based on the conference audio collection node until the feature quantity of the conference voiceprint feature meets the conference participant quantity of the target enterprise conference.

Specifically, in this embodiment, the audio track extraction is performed on the target conference video and audio data, so as to obtain target audio data, where the target audio data is the speaking information of the K participants. The conference audio collection node is used for limiting the duration of audio collection and the duration of audio collection intervals of the target audio data, and the conference audio collection node is used for collecting 3-minute audio and collecting 3-minute audio again at intervals of 3 minutes.

Randomly selecting conference audio acquisition starting time, for example, taking the 15 th second of the target audio data as conference audio acquisition starting time, taking the conference audio acquisition starting time as audio acquisition starting time, carrying out audio acquisition on the target audio data based on the conference audio acquisition node to obtain M sections of local audio data, wherein the M sections of local audio data are provided with M local audio acquisition nodes, M is a positive integer, and the local audio acquisition nodes are acquisition starting time and acquisition ending time of each section of local audio data in the target audio data. Based on the knowledge of the conference audio collection node, the M sections of local audio data comprise M sections of audio data which are spaced from the conference audio collection node.

Specifically, it should be understood that the voiceprint features of different people are different, and based on this, the voiceprint feature extraction rule is preset in this embodiment to realize that whether the M sections of local audio data are the speech of the same participant is judged according to the voiceprint features.

And traversing the M sections of local audio data based on the voiceprint feature extraction rules to obtain M groups of voiceprint feature parameters, wherein each group of voiceprint feature comprises a voiceprint feature parameter, a formant feature parameter and a short-time energy feature parameter.

And merging the data sets with the same parameter values of the voiceprint feature extraction items in the M groups of voiceprint feature parameters to obtain N types of reference voiceprint features, wherein each type of reference voiceprint feature comprises one or more groups of voiceprint feature parameters, a group of local audio collection nodes corresponding to the voiceprint feature parameters are randomly selected from each type of reference voiceprint feature, the local audio collection nodes are taken as the feature collection nodes of each type of reference voiceprint feature, N types of feature collection nodes are obtained in total, N is a positive integer less than or equal to K, the feature collection nodes have the same meaning as the local audio collection nodes, and the collection start time and the collection end time of the local audio data in the target audio data are all.

Judging whether the feature quantity of the N meeting voiceprint features meets the number of the K meeting participants, if the feature quantity of the N meeting voiceprint features meets the K meeting participants, indicating that K=N, namely, the speaking audio and the speaking voiceprint features of each of the K meeting participants are found currently.

After the speaking voice print characteristics of each of the K participants are obtained, the correspondence between the speaking voice print characteristics and the specific participants needs to be further determined, so as to accurately distinguish the speaking contents of the K participants in the target audio data of the single track.

Specifically, in this embodiment, N conference video information is extracted from the target conference video and audio data based on the N feature collection nodes, where video durations of the N conference video information have consistency with audio durations of the N feature collection nodes.

And pre-constructing a behavior feature comparison model, wherein the behavior feature comparison model is used for overlapping the two images so as to find out different areas of the two images and cut out the different areas.

Extracting and obtaining first conference video information based on the N conference video information, presetting a frame extraction interval, for example, extracting one frame at intervals of 3 seconds, obtaining F conference video frame images, taking adjacent images as a group to obtain F-1 group of conference video frame images in total, inputting the F-1 group of conference video frame images into a behavior feature comparison model to obtain F-1 region images, wherein the region images are substantially head cutting images of participants. And carrying out behavior feature analysis on the N conference video information by adopting the same method, and positioning to obtain N speaking participants, namely, K photos of the speaking participants.

Based on a target enterprise conference, obtaining K photos of K participants, wherein each photo has name identification, traversing the K photos based on the N speaking participants by adopting the existing face recognition technology, identifying the N speaking participants, determining the names of the K speaking participants, constructing association mapping between the identities of the participants and the N speaking voice print features, and obtaining that each of the K speaking voice print features explicitly points to a specific participant.

According to the embodiment, various voiceprint features are determined by analyzing the audio data, and the association relationship between the voiceprint features and participants is further built by combining the conference video, so that the technical effect of obtaining accurate voiceprint features of the participants is achieved, and the technical effect of providing audio classification references for obtaining conference speaking contents of each person in the target conference video and audio data based on the voiceprint features is achieved.

Further, if the feature quantity of the N participating voiceprint features is judged to be satisfied with the K participating persons, if the feature quantity of the N participating voiceprint features is not satisfied with the K participating persons, it indicates that the voiceprint features of each person of the K parameter persons are not obtained currently.

Based on this, in this embodiment, the conference audio collection start time is reset, for example, the 3 rd minute of the target audio data is the conference audio collection start time, the conference audio collection start time is taken as the audio collection start, the second audio collection is performed on the target audio data based on the conference audio collection node, and M segments of second local audio data are obtained, where the M segments of second local audio data have M second local audio collection nodes.

The method comprises the steps of traversing the M sections of second local audio data based on the voiceprint feature extraction rule by adopting the same method for obtaining N parametric voiceprint features and N feature collection nodes, and obtaining M groups of second voiceprint feature parameters; and carrying out aggregation treatment on the M groups of voiceprint characteristic parameters and the M groups of second voiceprint characteristic parameters to obtain H reference voiceprint characteristics and H characteristic acquisition nodes, wherein H is a positive integer less than or equal to K.

Judging whether the characteristic quantity of the H conference voiceprint characteristics meets the K participants, and so on, carrying out audio collection and voiceprint characteristic analysis on the target audio data for multiple times based on the conference audio collection node until the characteristic quantity of the conference voiceprint characteristics meets the number of the participants of the target enterprise conference.

According to the embodiment, the technical effect of ensuring that the voiceprint characteristics of each person of K parameter persons are obtained is achieved by carrying out audio collection and voiceprint characteristic analysis on the target audio data for multiple times.

A300, extracting the audio track of the target conference video and audio data to obtain target audio data;

specifically, in this embodiment, software supporting a function of extracting audio tracks from a video file, such as AdobeAudition, audacity, is selected, and the target conference video and audio data is imported to perform audio track extraction to obtain the target audio data, where the target audio data is a single audio track, and includes the participant speech information of each of the K participants.

A400, obtaining K groups of ginseng audios, wherein the K groups of ginseng audios are obtained by performing sound source separation on the target audio data by using the K groups of ginseng voiceprint features;

in one embodiment, as shown in fig. 2, the method steps provided by the present application further include:

a410, pre-constructing a voiceprint feature recognition sub-network, wherein the voiceprint feature recognition sub-network comprises a voiceprint feature recognition module, an audio splitting execution module and a participant audio storage module;

a420, wherein the reference audio storage module comprises K reference audio storage spaces;

a430, obtaining K groups of voiceprint features-participants according to the mapping relation between the K participant voiceprint features and the K participants;

and A440, taking the voiceprint features as a first attribute, taking the K participating voiceprint features as a first attribute value, taking the meeting participants as a second attribute, taking the K participating persons as a second attribute value, taking the K groups of voiceprint features-meeting participants as construction data, and constructing the voiceprint feature recognition module based on a knowledge graph.

In one embodiment, K sets of parametric audio are obtained, where the K sets of parametric audio are obtained by performing sound source separation on the target audio data using the K parametric voiceprint features, and the method step a400 provided in the present application further includes:

A440, after the target audio data is input into the voiceprint feature recognition sub-network, carrying out consultation personnel analysis based on the voiceprint feature recognition module to obtain a first analysis result and a first speaking period;

a450, the audio splitting execution module performs sound source synchronous separation on the target audio data according to the first speaking period to obtain a first reference audio segment;

a460, the reference audio storage module correspondingly calls a first reference audio storage space in the K reference audio storage spaces according to the first analysis result, and stores the first reference audio segment into the first reference audio storage space;

a470, the voiceprint feature recognition module analyzes the target audio data by taking the first speaking period as a starting point to obtain a second analysis result and a second speaking period;

a480, the audio splitting execution module performs sound source synchronous separation on the target audio data according to the second speaking period to obtain a second participant audio segment;

a490, the reference audio storage module correspondingly calls a second reference audio storage space in the K reference audio storage spaces according to the second analysis result, and stores the second reference audio segment into the second reference audio storage space;

And A4100, completing the sound source separation of the target audio data, and outputting the K groups of the ginseng audios by extracting the audio from the K ginseng audio storage spaces.

Specifically, in this embodiment, a voiceprint feature recognition sub-network is pre-constructed, where the voiceprint feature recognition sub-network is configured to automatically perform sound source separation on the target audio data according to the K pieces of reference voiceprint features, so as to obtain K groups of reference audios, where each group of reference audios is conference speech audio of one participant, and based on conference discussion characteristics, the conference speech audio of each participant often has a discontinuity, so each group of reference audios generally includes one or more sections of audio, and each section of audio has a beginning and ending time node identifier.

The voiceprint feature recognition sub-network comprises a voiceprint feature recognition module, an audio splitting execution module and a participant audio storage module, wherein the voiceprint feature recognition module is used for carrying out real-time recognition of voiceprint features on the target audio data and traversing and comparing the K participant voiceprint features so as to obtain the occurrence time and the ending time of each participant voiceprint feature and the participants corresponding to the voiceprint features.

The construction method of the voiceprint feature recognition module comprises the following steps of obtaining K groups of voiceprint features-participants according to the mapping relation between the K participant voiceprint features and the K participants; and taking the voiceprint features as a first attribute, taking the K participating voiceprint features as a first attribute value, taking the meeting participants as a second attribute, taking the K participating persons as a second attribute value, taking the K groups of voiceprint features and the meeting participants as construction data, and constructing and obtaining the voiceprint feature recognition module based on a knowledge graph.

The audio splitting execution module is used for correspondingly cutting the target audio data according to the voiceprint feature starting time and the voiceprint feature ending time output by the voiceprint feature recognition module.

The participating audio storage module comprises K participating audio storage spaces mapped to K participating persons, and audio storage of the corresponding participating persons is carried out based on the K participating audio storage spaces.

The specific method for obtaining the K groups of the reference audios by carrying out sound source separation on the target audio data based on the voiceprint feature recognition sub-network is as follows:

after the target audio data is input into the voiceprint feature recognition sub-network, the voiceprint feature recognition module extracts real-time voiceprint features of the target audio data based on the voiceprint feature extraction rule, and inputs the real-time voiceprint features into a knowledge graph for traversal comparison to obtain corresponding participants, until the participants obtained by real-time analysis change, a first analysis result is output, the first analysis result is a participant who continuously speaks in a period of time, and the first speaking period is a time interval formed by the fact that the voiceprint feature recognition module primarily recognizes a time node of the first analysis result corresponding to the participant to a time node of the first analysis result, where the participant changes.

The audio splitting execution module performs sound source synchronous separation on the target audio data according to the first speaking period to obtain a first reference audio segment, wherein the first reference audio segment is conference speaking audio of a speaker corresponding to the first analysis result.

And the reference audio storage module correspondingly calls a first reference audio storage space in the K reference audio storage spaces according to the first analysis result, and stores the first reference audio segment into the first reference audio storage space.

And the voiceprint feature recognition module analyzes the target audio data by taking a time node of the change of the participants of the first analysis result in the first speaking period as a starting point to obtain a second analysis result and a second speaking period.

The audio splitting execution module performs sound source synchronous separation on the target audio data according to the second speaking period to obtain a second participant audio segment;

the second audio storage module correspondingly calls a second audio storage space in the K audio storage spaces according to the second analysis result, and stores the second audio segments into the second audio storage space; and similarly, completing the sound source separation of the target audio data, and outputting the K groups of the reference audio by extracting the audio from the K reference audio storage spaces.

According to the embodiment, by constructing the voiceprint feature recognition sub-network, the target audio data are segmented and grouped according to the participants automatically and with high accuracy, and the technical effect of obtaining the speaking audio of each participant with high reliability is achieved.

A500, presetting a standard sound volume threshold, and carrying out sound volume unification on the K groups of parametric audios by adopting the standard sound volume threshold to obtain K groups of sound volume unification audios;

in one embodiment, as shown in fig. 3, a standard sound volume threshold is preset, and the sound volume unification processing is performed on the K groups of the reference audio by adopting the standard sound volume threshold to obtain K groups of sound volume unification audio, and the method step a500 provided by the application further includes:

a510, calculating sound volume of the K groups of ginseng audios to obtain K groups of ginseng sound volume values;

a520, carrying out average value calculation on the K groups of parameter sound volume values to obtain K overall sound volume indexes;

a530, interactively obtaining K participant seat information of the K participants in the target enterprise conference;

a540, interactively obtaining a target equipment position parameter for recording the video and audio data of the target conference;

a550, carrying out sound volume recording analysis according to the K parameter seat information and the target equipment position parameter to obtain K sound volume index weights;

A560, generating the standard sound volume threshold, wherein the standard sound volume threshold is obtained by calculating according to the K whole sound volume indexes and the K sound volume index weights;

a570, traversing the K groups of reference sound volume values based on the standard sound volume threshold value to adjust the audio segments segment by segment, so as to obtain the K groups of sound volume unified audio.

Specifically, in this embodiment, each section of the audio of each of the K sets of audio is loaded into a computer, and the audio is converted into a digital representation, such as PCM. The sound volume values of the audio signal are then calculated to obtain sound volume values for each audio clip constituting K sets of parametric sound volume values. And carrying out average value calculation on a plurality of parameter sound volume values in each group of parameter sound volume values in the K groups of parameter sound volume values to obtain K overall sound volume indexes.

And obtaining K pieces of seat information of the K participants in the target enterprise conference in an interaction way, wherein the seat information is the position coordinates of the participants in the conference room.

And A540, interactively obtaining the position parameters of target equipment for recording the video and audio data of the target conference, wherein the target equipment is the conference room camera and radio combined equipment only one set is mentioned in the step A100.

The target device position parameter of the target device is obtained, and it should be understood that, due to the deviation of the layout position of the target device and the distances between K participants, the participants are limited by the distance between the participants and the device in the acquired target conference video and audio data, and the speaking sound of the participants at different positions is different in the target conference video and audio data.

Based on the above, in this embodiment, K sound receiving distance parameters of the K pieces of parameter seat information and the target device position parameters are obtained by calculation, weight calculation is performed based on the K sound receiving distance parameters, K pieces of reverse sound receiving weights are obtained, and the K pieces of reverse sound receiving weights are subtracted one by 1, so as to obtain K sound volume index weights. And further carrying out weighted calculation on the K overall sound volume indexes and the K sound volume index weights to obtain the standard sound volume threshold, wherein the standard sound volume threshold is used as a standard for judging whether the K groups of reference sound volume values need to be subjected to sound volume value adjustment.

Traversing the K groups of parameter sound values based on the standard sound volume threshold value, and correspondingly adjusting the parameter sound volume values lower than the standard sound volume threshold value section by section to obtain the K groups of sound volume unified audio.

According to the method, the device and the system for adjusting the sound volume, the technical effects of eliminating speaking sound volume deviation of the participants caused by different distances between the radio equipment and the participants and improving listening experience of the obtained enterprise conference data during subsequent extraction and playing are achieved.

A600, obtaining target storage audio data, wherein the target storage audio data is obtained by time sequence reduction of the K groups of sound volume unified audios;

a700, obtaining target storage text data, wherein the target storage text data is obtained by carrying out text processing on the target storage audio data;

specifically, in this embodiment, since the K groups of sound volume-unified audio are all obtained by capturing from the target audio data, and each section of sound volume-unified audio in the K groups of sound volume-unified audio has an audio collection node identifier, audio time sequence restoration can be performed according to the audio collection node identifier, so as to obtain the target storage audio data with sound volume uniformity, where the target storage audio data has K participant speaking identifiers.

And carrying out text processing on the target storage audio data to obtain the target storage text data which is similar to a television local and accurately identifies the participants corresponding to the speaking text content.

According to the embodiment, through the text processing, the technical effects of storing enterprise conference data in a text mode and an audio mode and improving the validity of conference content reservation files are achieved.

A800, presetting meeting data reference authority, and storing the target storage audio data and the target storage text data into a target enterprise cloud space according to the meeting data reference authority.

In one embodiment, conference data reference rights are preset, and the target stored audio data and the target stored text data are stored in a target enterprise cloud space according to the conference data reference rights, and the method step a800 provided by the present application further includes:

a810, obtaining a target organization personnel architecture of a target enterprise;

a820, traversing the target organization personnel architecture based on the K participants to obtain an X-level organization architecture level;

a830, setting role control access to the K participants according to the X-level organization architecture level, and generating the conference data reference authority;

and A840, storing the target storage audio data and the target storage text data to a target enterprise cloud space according to the conference data reference authority.

Specifically, in this embodiment, a target organization personnel architecture of a target enterprise is obtained, and the job level of staff of the target enterprise can be known at a glance based on the target organization personnel architecture.

And traversing the target organization personnel architecture based on the K participants to obtain X-level organization architecture levels, wherein each level organization architecture level comprises one or more participants. The K participants are grouped into X groups according to the X-level organization architecture level, and roles are set for controlling access to the X groups respectively to form the conference data review authority. And storing the target storage audio data and the target storage text data to a target enterprise cloud space according to the conference data reference authority.

According to the method and the device, conference data consulting authority setting is carried out based on the target organization personnel architecture of the target enterprise, so that the technical effect of improving the data storage safety of the enterprise conference is achieved.

Example two

Based on the same inventive concept as the data intelligent storage method for enterprise conferences in the foregoing embodiments, as shown in fig. 4, the present application provides a data intelligent storage device for enterprise conferences, wherein the device includes:

The conference data interaction module 1 is used for interactively obtaining target conference video and audio data, wherein the target conference video and audio data are obtained by synchronously recording video and audio of a target enterprise conference, and the target enterprise conference has K participants, and K is a positive integer;

the voiceprint feature generation module 2 is used for generating K participant voiceprint features, wherein the K participant voiceprint features are obtained by performing video-audio synchronous analysis on target conference video-audio data, and the K participant voiceprint features are mapped with the K participants one by one;

the track extraction execution module 3 is used for carrying out track extraction on the target conference video and audio data to obtain target audio data;

the participant audio splitting module 4 is configured to obtain K sets of participant audios, where the K sets of participant audios are obtained by performing sound source separation on the target audio data using the K participant voiceprint features;

the sound volume processing execution module 5 is used for presetting a standard sound volume threshold value, and carrying out sound volume unification processing on the K groups of the parametric audios by adopting the standard sound volume threshold value to obtain K groups of sound volume unification audios;

a stored audio generating module 6, configured to obtain target stored audio data, where the target stored audio data is obtained by performing time sequence restoration on the K groups of sound volume unified audio;

A stored text obtaining module 7, configured to obtain target stored text data, where the target stored text data is obtained by performing text processing on the target stored audio data;

and the consulting authority setting module 8 is used for presetting meeting data consulting authority and storing the target storage audio data and the target storage text data into a target enterprise cloud space according to the meeting data consulting authority.

In one embodiment, the voiceprint feature generating module 2 further includes:

presetting conference audio acquisition nodes, and carrying out audio acquisition on the target audio data based on the conference audio acquisition nodes to obtain M sections of local audio data, wherein the M sections of local audio data are provided with M local audio acquisition nodes, and M is a positive integer;

presetting a voiceprint feature extraction rule, traversing the M sections of local audio data based on the voiceprint feature extraction rule, and obtaining M groups of voiceprint feature parameters;

performing aggregation treatment on the M groups of voiceprint feature parameters to obtain N reference voiceprint features and N feature acquisition nodes, wherein N is a positive integer less than or equal to K;

judging whether the feature quantity of the N participant voiceprint features meets the K participants;

If the feature quantity of the N conference voiceprint features meets the K participants, extracting N conference video information from the target conference video and audio data based on the N feature acquisition nodes;

performing behavior feature analysis on the N conference video information, and positioning to obtain N speaking participants;

and carrying out identity recognition on the N speaking participants, and constructing association mapping between the identities of the participants and the N participating voiceprint features to obtain the K participating voiceprint features.

In one embodiment, the voiceprint feature generating module 2 further includes:

if the feature quantity of the N participating voiceprint features does not meet the K participants, performing secondary audio acquisition on the target audio data based on the conference audio acquisition nodes to obtain M sections of second local audio data, wherein the M sections of second local audio data are provided with M second local audio acquisition nodes;

traversing the M sections of second local audio data based on the voiceprint feature extraction rules to obtain M groups of second voiceprint feature parameters;

Performing aggregation treatment on the M groups of voiceprint feature parameters and the M groups of second voiceprint feature parameters to obtain H types of reference voiceprint features and H types of feature acquisition nodes, wherein H is a positive integer less than or equal to K;

judging whether the feature quantity of the H participant voiceprint features meets the K participants;

and the like, carrying out audio collection and voiceprint feature analysis on the target audio data for multiple times based on the conference audio collection node until the feature quantity of the conference voiceprint features meets the conference participant quantity of the target enterprise conference.

In one embodiment, the parametric audio splitting module 4 further comprises:

the voice print feature recognition sub-network is pre-constructed, wherein the voice print feature recognition sub-network comprises a voice print feature recognition module, an audio splitting execution module and a reference audio storage module;

the reference audio storage module comprises K reference audio storage spaces;

obtaining K groups of voiceprint features-participants according to the mapping relation between the K participant voiceprint features and the K participants;

and taking the voiceprint features as a first attribute, taking the K participating voiceprint features as a first attribute value, taking the meeting participants as a second attribute, taking the K participating persons as a second attribute value, taking the K groups of voiceprint features and the meeting participants as construction data, and constructing the voiceprint feature recognition module based on a knowledge graph.

In one embodiment, the parametric audio splitting module 4 further comprises:

after the target audio data are input into the voiceprint feature recognition sub-network, the analysis of participants is carried out based on the voiceprint feature recognition module, and a first analysis result and a first speaking period are obtained;

the audio splitting execution module performs sound source synchronous separation on the target audio data according to the first speaking period to obtain a first participant audio segment;

the first audio storage space is correspondingly called by the audio storage module according to the first analysis result in the K audio storage spaces, and the first audio segments are stored in the first audio storage space;

the voiceprint feature recognition module analyzes the target audio data by taking the first speaking period as a starting point to obtain a second analysis result and a second speaking period;

the second audio storage module correspondingly calls a second audio storage space in the K audio storage spaces according to the second analysis result, and stores the second audio segments into the second audio storage space;

And similarly, completing the sound source separation of the target audio data, and outputting the K groups of the reference audio by extracting the audio from the K reference audio storage spaces.

In one embodiment, the sound volume processing execution module 5 further includes:

performing sound volume calculation on the K groups of parametric audios to obtain K groups of parametric sound volume values;

calculating the average value of the K groups of parameter sound volume values to obtain K overall sound volume indexes;

obtaining K participant seat information of the K participants in the target enterprise conference interactively;

the position parameters of target equipment for recording the video and audio data of the target conference are obtained interactively;

performing sound volume recording analysis according to the K parameter seat information and the target equipment position parameter to obtain K sound volume index weights;

generating the standard sound volume threshold, wherein the standard sound volume threshold is obtained by calculating according to the K whole sound volume indexes and the K sound volume index weights;

and traversing the K groups of parameter sound volume values based on the standard sound volume threshold value to perform segment-by-segment adjustment of the audio fragments, so as to obtain the K groups of sound volume unified audio.

In one embodiment, the consulting authority setting module 8 further includes:

Obtaining a target organization personnel architecture of a target enterprise;

traversing the target organization personnel architecture based on the K participants to obtain an X-level organization architecture level;

setting role control access to the K participants according to the X-level organization architecture level, and generating the conference data reference authority;

and storing the target storage audio data and the target storage text data to a target enterprise cloud space according to the conference data reference authority.

Any of the methods or steps described above may be stored as computer instructions or programs in various non-limiting types of computer memories, and identified by various non-limiting types of computer processors, thereby implementing any of the methods or steps described above.

Based on the above-mentioned embodiments of the present invention, any improvements and modifications to the present invention without departing from the principles of the present invention should fall within the scope of the present invention.

Claims

1. A method for intelligently storing data for enterprise conferences, the method comprising:

the method comprises the steps of interactively obtaining video and audio data of a target conference, wherein the video and audio data of the target conference are obtained by synchronously recording video and audio of a target enterprise conference, and the target enterprise conference has K participants, wherein K is a positive integer;

Generating K reference voiceprint features, wherein the K reference voiceprint features are obtained by performing video and audio synchronous analysis on target conference video and audio data, and the K reference voiceprint features are mapped with the K participants one by one;

extracting the audio track of the target conference video and audio data to obtain target audio data;

obtaining K groups of parametric audio, wherein the K groups of parametric audio are obtained by performing sound source separation on the target audio data by using the K parametric voiceprint features;

presetting a standard sound volume threshold, and carrying out sound volume unification on the K groups of parametric audios by adopting the standard sound volume threshold to obtain K groups of sound volume unification audios;

obtaining target storage audio data, wherein the target storage audio data is obtained by time sequence reduction of the K groups of sound volume unified audios;

obtaining target storage text data, wherein the target storage text data is obtained by carrying out text processing on the target storage audio data;

presetting meeting data consulting authority, and storing the target storage audio data and the target storage text data into a target enterprise cloud space according to the meeting data consulting authority;

Generating K participant voiceprint features, wherein the K participant voiceprint features are obtained by performing video-audio synchronous analysis on target conference video-audio data, the K participant voiceprint features are mapped with the K participants one by one, and the method further comprises:

judging whether the feature quantity of the N participating voiceprint features meets the K participating persons or not comprises the following steps:

and so on, carrying out audio collection and voiceprint feature analysis on the target audio data for multiple times based on the conference audio collection node until the feature quantity of the conference voiceprint features meets the conference participant quantity of the target enterprise conference;

2. The method of claim 1, wherein the method further comprises:

the reference audio storage module comprises K reference audio storage spaces;

3. The method of claim 2, wherein K sets of parametric audio are obtained, wherein the K sets of parametric audio are obtained by sound source separation of the target audio data using the K parametric voiceprint features, the method further comprising:

4. The method of claim 1, wherein a standard sound volume threshold is preset, and the sound volume unification processing is performed on the K sets of reference audio using the standard sound volume threshold to obtain K sets of sound volume unification audio, the method further comprising:

5. The method of claim 1, wherein conference data review rights are preset and the target stored audio data and the target stored text data are stored to a target enterprise cloud space according to the conference data review rights, the method further comprising:

Obtaining a target organization personnel architecture of a target enterprise;

6. A data intelligent storage device for enterprise conferences, the device comprising:

the conference data interaction module is used for interactively obtaining target conference video and audio data, wherein the target conference video and audio data are obtained by synchronously recording video and audio of a target enterprise conference, the target enterprise conference has K participants, and K is a positive integer;

the voice print feature generation module is used for generating K reference voice print features, wherein the K reference voice print features are obtained by performing video and audio synchronous analysis on video and audio data of a target conference, and the K reference voice print features are mapped with the K participants one by one;

the track extraction execution module is used for carrying out track extraction on the target conference video and audio data to obtain target audio data;

The parameter audio splitting module is used for obtaining K groups of parameter audios, wherein the K groups of parameter audios are obtained by performing sound source separation on the target audio data by using the K parameter voiceprint features;

the sound volume processing execution module is used for presetting a standard sound volume threshold value, and carrying out sound volume unification processing on the K groups of parametric audios by adopting the standard sound volume threshold value to obtain K groups of sound volume unification audios;

the storage audio generation module is used for obtaining target storage audio data, wherein the target storage audio data is obtained by performing time sequence reduction on the K groups of sound volume unified audios;

the storage text obtaining module is used for obtaining target storage text data, wherein the target storage text data is obtained by carrying out text processing on the target storage audio data;

the reference authority setting module is used for presetting conference data reference authority and storing the target storage audio data and the target storage text data into a target enterprise cloud space according to the conference data reference authority;

wherein, the voiceprint feature generation module is further configured to: