UE Voice Pipeline

VGT Voice Management System Technical Details#

Introduction#

In current work, we can find that configuration and testing are significant pain points for game voice. In my working environment, a voice needs coordination from multiple parties from resource production to final integration into the game: planning provides requirements → audio design provides resources → audio design configures Wwise → programming integrates into the game → planning verifies the effect.

It can be observed that this process is quite lengthy, and each adjustment requires participation from multiple departments, leading to low efficiency. Additionally, while Wwise is powerful, configuring complex playback logic can still be cumbersome, especially when it involves features like sequential playback and conditional filtering.

Therefore, we developed our own voice pipeline for the UE engine called "VGT," which is still based on Wwise for decoding and playback (after all, Wwise's audio processing capabilities are strong), but we completely redesigned resource management and playback logic control on the upper layer, and added some practical supporting systems (including blueprint support, callback systems, sequence systems, audio-visual synchronization systems, etc.). This pipeline not only addresses the collaboration issues mentioned above but also allows us to break free from complex Wwise project configurations, while also optimizing resource management and loading efficiency.

Core System Modules#

UI Module#

VGT provides a UI interface to help users quickly import resources and test playback. Considering that planners are more accustomed to using Excel, we implemented an Excel data-driven configuration method. The imported wav voice files will be converted to wem format using Wwise's transcoding tool.

From the interface, it can be seen that the entire operation process is quite intuitive; planners can directly import, configure, and test without waiting for support from programmers or audio designers.

Resource Management Module#

The number of voices in large games is indeed an issue, with tens of thousands of voice lines, and loading all of them into memory is certainly unrealistic. We referenced Wwise's SoundBank concept and designed a "Module" architecture.

Each Module acts like an independent voice package that can be dynamically loaded and unloaded based on game needs. For example, the voice for a specific level can be packaged into a Module, loaded when entering the level, and unloaded when leaving. The data structure of the Module is defined as follows:

struct VoiceModuleData{
    FString moduleName;
    FString modulePath;
    
    int32 maxPlayedEventInstanceNum;
    int32 maxSavedEventInfoNum;
    
    // Playable voices in this module
    TMap<FString, TArray<TSharedPtr<EventData>>> eventMap;
    // Container type corresponding to each voice in this module (random container, sequential container...)
    TMap<FString, int32> eventTypeMap;
}

For locally transcoded wem voice resources, we assign a unique ID to each file to avoid performance loss caused by excessive string operations. Additionally, to quickly locate file paths, we designed a path management system based on an AVL tree:

Mermaid Loading...

The benefit of this design is that when we need to find the corresponding file based on the voice ID, we can quickly locate the corresponding directory path, with lookup efficiency essentially at O(log n) level.

For managing Event information, we defined the following data structure:

struct EventInfo{
    int32 ResourceID; // Unique ID allocated for the resource
    int32 Order; // Sequential playback related parameters
    int32 Probability; // Random playback weight
    int32 AdditionalFieldsNum; // Number of additional filtering conditions
}

Event information will be stored in the corresponding Module directory and loaded as needed.

Resource Playback Module#

To reduce memory pressure, we adopted a Streaming playback method. This way, we do not need to preload all voice files into memory.

The underlying playback implementation is still based on Wwise's External Source mechanism:

// Construct externalSource based on audio path
TArray<AkOSChar> FullPath;
FullPath.Reserve(fullPath.Len() + 1);
FullPath.Append(TCHAR_TO_AK(*fullPath), fullPath.Len() + 1);
AkExternalSourceInfo externalSourceInfo(FullPath.GetData(), externalSrcCookie, AKCODECID_VORBIS);
TArray<AkExternalSourceInfo> externalSources;
externalSources.Add(externalSourceInfo);

// Playback
playingID = AkAudioDevice->PostEventOnGameObjectID(eventID, akGameObjectID, AK_EndOfEvent | AK_EnableGetSourcePlayPosition, &FVoicePlayer::OnEventCallback, Data.Get(), externalSources);

However, the key point is that on the upper layer, we completely implemented our own playback logic control system. This system essentially replicates the functionality of Wwise's Random Container and Sequence Container but is more flexible and easier to extend.

We implemented RandomContainer (random playback container), SequenceContainer (sequential playback container), and SingleContainer (single playback container). The system determines which playback strategy to use based on the EventType stored in the Module.

Taking RandomContainer as an example, we implemented a complete weight random algorithm and added a mechanism to prevent repeated playback:

float TotalPercentage = 0.0f;
for (const TSharedPtr<RandomEventConfig>& Config : EventConfigs)
{
    TotalPercentage += Config->Probability;
}
// If the total weight does not add up to 100, normalize it
if (TotalPercentage != 100.0f)
{
    if (fabs(TotalPercentage) < FLT_EPSILON) 
    {
        // If all are 0, evenly distribute
        for (const TSharedPtr<RandomEventConfig>& Config : EventConfigs)
        {
            Config->Probability = (1.0 / EventConfigs.Num()) * 100.0f;
        }
    }
    else 
    {
        // Otherwise, normalize proportionally
        for (const TSharedPtr<RandomEventConfig>& Config : EventConfigs)
        {
            Config->Probability = (Config->Probability / TotalPercentage) * 100.0f;
        }
    }
}

// Filter out voices that can still be played (excluding those already played)
float AvailableTotalPercentage = 0.0f;
TArray<TSharedPtr<RandomEventConfig>> AvailableConfigs;
for (auto i = 0; i < EventConfigs.Num(); ++i)
{
    if (EventConfigs[i]->bShouldReplaced)
    {
        continue;
    }
    AvailableTotalPercentage += EventConfigs[i]->Probability;
    AvailableConfigs.Add(EventConfigs[i]);
}

// If all voices have been played, reset
if (AvailableTotalPercentage <= 0)
{
    ForceRefreshPlaySet();
    return PlayVoice(akGameObjectID, trackID, percent, callback, pParam);
}

// Randomly select based on weight
float RandomNumber = FMath::RandRange(0.0f, AvailableTotalPercentage);
float AccumulatedPercentage = 0.0f;
for (const TSharedPtr<RandomEventConfig>& Config : AvailableConfigs)
{
    AccumulatedPercentage += Config->Probability;
    if (RandomNumber < AccumulatedPercentage)
    {
        if (Config->bShouldReplaced == true)
        {
            continue;
        }
        PlayVoice();
        break;
    }
}

The implementation of SequenceContainer follows a similar approach, maintaining a playback index to play voices in order, supporting looping, jumping, and other functions.

To facilitate program calls, we also designed a unified callback system. This system will centrally handle Wwise's callback events, triggering corresponding callback functions based on different contexts, supporting both C++ and blueprints:

Performance Optimization and Thread Safety#

In terms of performance optimization, we made several key improvements:

First is delayed loading. We do not load all Module information into memory at once but load it only when needed.

Second is data format optimization. Since each Module has a lot of Event information, JSON deserialization is indeed a performance bottleneck. Therefore, we convert JSON to binary format during packaging, directly mapping it to virtual memory at runtime, only parsing the offset of each Event in the data, and accessing it directly when needed:

GetEvent(eventName) -> LoadEventOffsetMap() -> LoadBinaryEventConfigMap():

TUniquePtr<FArchive> DataFileReader(IFileManager::Get().CreateFileReader(*EventConfigPath));
DataFileReader->Seek(EventDataOffsetMap[eventName]);
ResultConfig->LoadBinary(DataFileReader);

return ResultConfig;

This can greatly reduce memory usage and loading time.

For thread safety, we adopted a modular design, where each thread has its own VoicePlayerInstance, reducing lock contention:

Audio-Visual Synchronization and Sequence Support#

In addition to basic playback functionality, we also implemented an audio-visual synchronization system. This feature is mainly to solve the precise synchronization problem between voice and animations, UI, etc., which is particularly useful in cutscenes.

We integrated UE's Sequencer system, achieving millisecond-level synchronization accuracy by automatically calculating voice duration. The system will automatically analyze the voice file to obtain precise duration and then implement synchronization checks every frame through a Ticker mechanism.

Additionally, we also support the concept of TrackID, which allows better management and control of different audio tracks, which is useful for complex dialogue systems.

Conclusion#

The core idea of the entire VGT system is to redesign the upper layer's playback logic and resource management based on Wwise, making the entire voice system more flexible and efficient. Although there are still some modules (such as blueprint support, multi-language management, etc.) that are not detailed here, the core architecture and design ideas are essentially as described.

The greatest value of this system lies in allowing planners to independently complete voice configuration work while also providing programmers with more flexible playback control capabilities. Of course, performance optimization and stability are also points we pay attention to, as the stability of the voice system in games is still very important.