Reversing Steam Voice Codec

Discovering and reversing Steam client’s voice codec to extract audio data for my Source Chat Relay project.

Prelude

Within TF2 and various other Source engine titles, game server operators may choose from a range of voice codecs for players to communicate with.

This is configurable under the console variable sv_voicecodec, with these possible values to choose from.

vaudio_speex - Legacy Speex codec (lowest quality)
vaudio_celt - Newer CELT codec (22kHz, 22kbps)
vaudio_celt_high - Newer CELT codec, higher bitrate (44kHz, 44kbps)
steam - Use Steam voice API

I’m particularly interested in the steam codec, as it was made the default for TF2 awhile back. With this option selected, each player voice data is encoded/decoded using the local Steam client.

Capturing data for analysis

Now that I have found the target, next is to capture a subset of data for analysis.

Within the Source engine, there exists an SV_BroadcastVoiceData static function that is used to broadcast voice data to all connected players.

Its decompilation definition looks something like this:

void __cdecl SV_BroadcastVoiceData(IClient *a1, int a2, char *a3, __int64 a4)

The two parameters I’m interested in are int a2 and char *a3, which are the buffer length and the buffer ptr.

Thankfully, on Linux & Mac distributions, the binary contains symbols; after dumping, the following symbol is used for detouring: _Z21SV_BroadcastVoiceDataP7IClientiPcx.

With the static function detour enabled, when the function was called, the buffer is dumped to a file that can be analyzed off-the-game.

The resulting buffer data contains bytes like the following for each call.

**96 4B C6 11 01 00 10 01** <0B C0 5D 06> 4A 00 46 00
17 00 68 3E BE 82 95 DF 9C FA C2 8E 67 1A 54 EF
B4 E3 70 0A 90 35 4C 6B 2F 77 53 82 16 03 05 7C
61 98 AB EB DE EC BB E4 9C EB E9 B1 3E 0F E2 E5
45 81 46 E1 36 44 75 53 FF AC 49 6D 94 81 E4 54
25 39 CB 79 C2 C2 73 DD A7 34 A6 A7

**96 4B C6 11 01 00 10 01** <0B C0 5D 06> 50 00 4C 00
2D 00 68 3D C5 F6 28 6A 11 9C EE 04 D1 26 3A A9
6B D4 68 E4 7D BB 5F 6E 84 A5 7A 2E C5 06 7F 57
3A A5 24 5C 3F 47 FE 76 FE B1 FF BB F8 09 93 42
6B D3 F3 FC 21 F1 8C 5F DD DE D0 D1 58 43 1B 2D
1F 06 53 CB 2A B8 BC 77 F0 3C 9A 24 FF 67 32 56
D8 B0

Analyzing the dumps

After consecutive dumps, some patterns began to emerge from the bytes.

  • **bytes** changes per player; the value in u64 little-endian is the player’s steamid64.
  • <bytes> remains constant for all the dumps.

There seem to be all the patterns I could recognize from looking at the bytes alone. So, it’s disassembly time.

Blind static analysis

Since the steam codec is used, and the local Steam client is used for encoding/decoding, the logical step would be to disassemble the steamclient.

In the Steam directory, it just happens to be conveniently called steamclient.dll, a 32-bit file, but unfortunately, it contains no symbols.

Without symbols to rely upon, string references for “opus” were used to find related subroutines.

Several interesting references were found:

test_opus_voice_encode -> sub_380A9600
Created OPUS PLC voice encoder\n -> sub_380B6D80

The former subroutine looks like a test command registration that calls sub_380BC660.

int __usercall sub_380BC660@<eax>(int a1@<edi>, int a2@<esi>, char *a3)
{
  // ...

  if ( !(unsigned __int8)sub_380BF840(a3, (int)&v12, (int)&v16, (int)&v15, (int)&v13, (int)&v14) )
    return Msg("Unable to read %s\n", a3);
  if ( v15 != 16 || v13 != 1 || v14 != 24000 )
    return Msg(
             "Wave file mismatch was got %d bits, %d channels, %d sample rate was expecting %d bits, %d channels, %d sample rate\n",
             v15,
             v13,
             v14,
             16,
             1,
             24000);
  v4 = sub_380C1540(1);
  v5 = v4;
  v11 = v4;
  if ( !v4 )
    return Msg("Couldn't create OPUS codec\n");
  if ( (*(unsigned __int8 (__thiscall **)(int, int, int))(*(_DWORD *)v4 + 4))(v4, 5, 24000) )
  {
    v10 = sub_385B6C30(v16);
    v6 = (*(int (__fastcall **)(int, int, int, int, int, int, int, int, int))(*(_DWORD *)v5 + 16))(
           v5,
           v16 % (v15 / 8),
           v12,
           v16 / (v15 / 8),
           v10,
           v16,
           1,
           a1,
           a2);
    Msg("Compressed %d source bytes to %d compressed bytes\n", v16, v6);
    v9 = (void *)sub_385B6C30(v16);
    v7 = 2 * (*(int (__thiscall **)(int, int, int, void *, int))(*(_DWORD *)v11 + 20))(v11, v10, v6, v9, v16);
    Msg("Uncompressed %d bytes to %d bytes\n", v6, v7);
    V_StripExtension(a3, FileName, 260);
    V_strncat(FileName, "_out.wav", 260, -1);
    result = sub_380C1010(FileName, v9, v7, 16, 1, 24000);
  }
  else
  {
    (*(void (__thiscall **)(int))(*(_DWORD *)v5 + 12))(v5);
    result = Msg("Couldn't init OPUS codec\n");
  }
  return result;
}

The coders are expecting 16 bits, 1 channel, and a 24000 sample rate. While this subroutine isn’t all that helpful for determining the structure of the Steam voice codec, it does confirm that Valve isn’t doing anything peculiar.

The latter subroutine is rather interesting. It claims to create an OPUS PLC voice encoder.

char __thiscall sub_380B6D80(_DWORD *this)
{
  // ... 
  switch ( v4 )
  {
    case 2:
    case 1:
      this[41] = 4;
      goto LABEL_15;
    case 4:
LABEL_15:
      v5 = sub_380C1C80();
      v7 = "Created SILK voice encoder\n";
      break;
    case 5:
      v5 = sub_380C1540();
      v7 = "Created OPUS voice encoder\n";
      break;
    case 6:
      v5 = sub_380C14F0();
      v7 = "Created OPUS PLC voice encoder\n";
      break;
    default:
      goto LABEL_17;
  }
  // ... 
}

This contains multiple codec references, including SILK which is not my intended target. I inferred this sub as a generic init, with each switch case calling a different sub, all assigned to the same register (EAX).

Looking at the previously dumped bytes, in the constant bytes, there is a 0x06 byte. Perhaps it means OPUS PLC. This aligns with the findings since the voice data is sent over an unreliable stream and is subject to data loss in which PLC accounts for.

The decompilation for OPUS PLC setup sub:

int sub_380C14F0()
{
  /// ...
  v0 = sub_385B6C30(72);
  result = 0;
  if ( v0 )
  {
    *(_DWORD *)v0 = &VoiceEncoder_OPUS::`vftable';
    *(_BYTE *)(v0 + 4) = 1;
    *(_DWORD *)(v0 + 8) = 0;
    *(_DWORD *)(v0 + 12) = 32000;
    *(_DWORD *)(v0 + 16) = 0;
    *(_WORD *)(v0 + 20) = 0;
    sub_388D11E0(0, 0, 0);
    *(_DWORD *)(v0 + 64) = 0;
    *(_WORD *)(v0 + 68) = 0;
    result = v0;
  }
  return result;
}

Aha, it looks like it points to a vtable named VoiceEncoder_OPUS, and that vtable doesn’t have many offsets either!

.rdata:38C68438 [email protected]@[email protected] dd offset sub_380C11E0
.rdata:38C68438                                         ; DATA XREF: sub_380C11E0+9↑o
.rdata:38C68438                                         ; sub_380C14F0+19↑o ...
.rdata:38C6843C                 dd offset sub_380C1750
.rdata:38C68440                 dd offset sub_380C1740
.rdata:38C68444                 dd offset sub_380C1850
.rdata:38C68448                 dd offset sub_380C1250
.rdata:38C6844C                 dd offset sub_380C1590
.rdata:38C68450                 dd offset sub_380C1890
.rdata:38C68454                 dd offset sub_380C18F0

Offset 4 and 5 of this vtable are the encoding & decoding methods, and IDA was able to resolve one of the parameter’s name, potentially from some external calls, making identification easier.

int __thiscall sub_380C1250(char *this, void *Src, int a3, int a4, int a5, char a6) // 4
unsigned int __thiscall sub_380C1590(_WORD *this, int a2, int a3, void *a4, int a5) // 5

Since IDA is not able to completely decompile the virtual function calls, it is much easier to attach a debugger, set breakpoints in either virtual function, and analyze the call stack.

Analyzing call stack

Using the local Windows debugger, a breakpoint set within the virtual decoder function, and another client sending voice, it yields the following call stack.

1. sub_380C1590
2. sub_380BBDE0+0x2B9A
3. sub_380BC1F0+0x305B
4. sub_380B6A50+0x4FE8
5. sub_380B2050+0x5CC

In sub_380B2050, it performs CRC32 checksum from the start of the packet up until last 4 bytes, and compares it to the read CRC32, which is the last 4 bytes

int __thiscall sub_380B2050(_DWORD *this, int a2, unsigned int a3, int a4, int a5, _DWORD *a6, int a7)
{
  // ...
  v9 = sub_388D70A0(a2, a3 - 4); // <-- CRC32
  sub_388D3570(2, 4);
  if ( sub_388D2130(v16) == v9
    && (sub_388D3570(0, 0),
        v10 = sub_388D2040(v16),
        v11 = HIDWORD(v10),
        v12 = v10,
        (v13 = (HIDWORD(v10) >> 20) & 0xF) != 0)
    && v13 < 0xB
    && SHIDWORD(v10) >> 24 > 0
    && SHIDWORD(v10) >> 24 < 5
    && (v13 != 1 || (_DWORD)v10 && (HIDWORD(v10) & 0xFFFFF) == 1)
    && (v13 != 7 || (_DWORD)v10 && (v10 & 0xFFFFF00000000i64) == 0)
    && ((_DWORD)v10 || v13 != 3) )
  // ...
}

Within sub_380BC1F0, it contains the parsing of individual payload types.

int __thiscall sub_380BC1F0(int this, int a2, int a3, char a4, int a5, int a6, int a7)
{
  // ... 
  if ( v40 < v41 )
  {
    while ( 2 )
    {
      v10 = sub_388D1790(v38);
      v11 = v10;
      switch ( v10 )
      {
        case 0:
          v13 = (unsigned __int16)sub_388D21D0(v38);
          // ...
        case 1:
        case 3:
        case 4:
        case 5:
        case 6:
          v42 = *(_DWORD *)(this + 48112);
          // ...
          v31 = (unsigned __int16)sub_388D21D0(v38);
          if ( v31 > v41 - v40 )
          {
            sub_380B7A70("bad voice payload", &a4, 1);
            *(_DWORD *)(this + 48112) = v42;
            sub_388D64B0(v38);
            result = 5;
          }
          else
          {
            if ( sub_380BBDE0(this, v30, (void *)(v40 + v39), v31, v11, a6) )
            {
              sub_388D3570(1, v31);
              if ( !*(_BYTE *)(this + 48123) || *(_BYTE *)(this + 48121) )
              {
                v32 = sub_380B1D00(v35, "Received voice data from remote end (%d samples)\n", v31 >> 1);
                sub_380B7A70(v32, &a4, 0);
                *(_BYTE *)(this + 48123) = 1;
              }
              *(_DWORD *)(this + 48112) = v42;
              goto LABEL_44;
            }
            *(_DWORD *)(this + 48112) = v42;
            sub_388D64B0(v38);
            result = 1;
          }
          return result;
          // ...
        case 11:
          v12 = sub_388D21D0(v38);
          // ...
      }
    }
  }
  sub_388D64B0(v38);
  return 0;
}

From this parsing logic, it deduces the following payload structures.

0:
  u16 num_samples;
1, 2, 3, 4, 5, 6:
  u16 byte_size;
  [u8] of len byte_size;
11:
  u16 sample_rate; (48000, 44100, 32000, 24000, 16000, 12000, 8000)

Aggregating & applying all the information

Taking the same two dumps from prior and applying the discoveries.

**96 4B C6 11 01 00 10 01** <0B C0 5D> [06 4A 00 46 00
17 00 68 3E BE 82 95 DF 9C FA C2 8E 67 1A 54 EF
B4 E3 70 0A 90 35 4C 6B 2F 77 53 82 16 03 05 7C
61 98 AB EB DE EC BB E4 9C EB E9 B1 3E 0F E2 E5
45 81 46 E1 36 44 75 53 FF AC 49 6D 94 81 E4 54
25 39 CB 79 C2 C2 73 DD] {A7 34 A6 A7}

**96 4B C6 11 01 00 10 01** <0B C0 5D> [06 50 00 4C 00
2D 00 68 3D C5 F6 28 6A 11 9C EE 04 D1 26 3A A9
6B D4 68 E4 7D BB 5F 6E 84 A5 7A 2E C5 06 7F 57
3A A5 24 5C 3F 47 FE 76 FE B1 FF BB F8 09 93 42
6B D3 F3 FC 21 F1 8C 5F DD DE D0 D1 58 43 1B 2D
1F 06 53 CB 2A B8 BC 77 F0 3C 9A 24 FF 67] {32 56
D8 B0}
  • **bytes** - changes per player; the value in u64 little-endian is the player’s steamid64.
  • <bytes> - 0x0B (11) payload type with u16 indicating sample rate.
  • [bytes] - 0x06 payload type, with u16 indicating byte length, with OPUS PLC data following it with length of the read u16.
  • {bytes} - CRC32 checksum.

From here, the codec is complete, and we can begin decoding the OPUS PLC data!

Credits

Special thanks to SlidyBat and asherkin on Alliedmodders for guiding me with disassembly and initial voice work to base off of.