Training kaldi models with custom features

Kaldi Speech Recognition Toolkit is a freely available toolkit that offers several tools for conducting research on automatic speech recognition (ASR). It lets us train an ASR system from scratch all the way from the feature extraction (MFCC,FBANK, ivector, FMLLR,…), GMM and DNN acoustic model training, to the decoding using advanced language models, and produce state-of-the-art results.

While kaldi offers so much flexibilty at every stage, sometimes we also need to play with features that are not offered by the kaldi repository. Kaldi makes use of ark format to store the features. If we want to perform experiments with customized features, they must be converted to the ark format first. The goal of this post is to explain how we can extract and store the custom features in the ark format using matlab and python.

First Steps

In this tutorial, we will be starting with a data folder that is generated by kaldi, in order to avoid directory validation errors (which is performed by kaldi before training any model). So we we will run the kaldi routine to prepare the data folders (typically done by local/<datasetname> script). This will generate all the required files such as text containing the transcriptions, utt2spk and spk2utt for CMVN and ivector computation, and wav.scp for reading the audio files from the disk. Then we will run the feature extraction pipeline in the default kaldi way, say FBANK features. Typically, the data-fbank folder contains the train, test and dev folder along with the text, utt2spk, spk2utt and wav.scp files.

Running the will typically store the features in the ark format to the fbank folder and the corresponding feats.scp files (used for reading the ark files) will be placed in the respective subfolder in data-fbank. In order to run the feature extraction in parallel, kaldi typically splits the data into 10 sets and extract them in parallel. Thus the directory structure will typically be:

├── raw_fbank_dev.1.ark
├── raw_fbank_dev.1.scp
├── ...
├── ...
├── raw_fbank_dev.10.ark
├── raw_fbank_dev.10.scp
├── raw_fbank_test.1.ark
├── raw_fbank_test.1.scp
├── ...
├── ...
├── raw_fbank_test.10.ark
├── raw_fbank_test.10.scp
├── raw_fbank_train.1.ark
├── raw_fbank_train.1.scp
├── ...
├── ...
├── raw_fbank_train.10.ark
└── raw_fbank_train.10.scp

├── dev
│   ├── feats.scp
│   ├── spk2gender
│   ├── spk2utt
│   ├── stm
│   ├── text
│   ├── utt2spk
│   ├── wav.scp
│   └── split10
├── test
│   ├── feats.scp
│   ├── spk2gender
│   ├── spk2utt
│   ├── text
│   ├── utt2spk
│   ├── wav.scp
│   └─ split10
├── train
    ├── feats.scp
    ├── spk2gender
    ├── spk2utt
    ├── split10
    ├── text
    ├── train
    ├── utt2spk
    └── wav.scp

To summarize the steps so far:

# run the data preparation
bash local/<database> 
# extract fbank features
mkdir fbank data-fbank
for x in train test dev ; do
    cp -r data/$x data-fbank/
    steps/ --nj 10  --fbank-config conf/ \
      data-fbank/$x exp/make_fbank/$x fbank || exit 1;

Notice that feats.scp is a concatenation of all the .scp files belonging to the subset (train, dev or test). .scp files will have two columns: the first column contains the unique id for the utterance and the second column says where the corresponding features are located in the ark file. This tutorial will manipulate the ark files and scp files in order that they are to be used by kaldi for training.

The the next step is to create myfeats and data-myfeats directories. These directories will follow the same structure as used by the fbank and data-fbank directories, but contains our custom features inside. Noice that the data folder will have the same files except the feats.scp. feats.scp has to be modified such that it tells kaldi to read features from the myfeats folder. The steps are summarized below:

$ mkdir myfeats data-myfeats     
$ cp -r data-fbank/* data-myfeats       
$ rm data-myfeats/*/feats.scp  # we will need to generate new feats.scp files for the custom feats

Now the pre-requisites are done. Next, we can use either MATLAB or python to prepare the custom features and generate the correspoding ark and scp files.

Click here to jump to the python approach

Notice that if you want to use the alignments obtained using say MFCC features, the number of frames in the new features should match that of the MFCC features. i.e., the custom feature extraction must use the same window size and window shift. Otherwise, we will end up with wrong alignments and the (DNN or GMM) training will fail.

Custom features using MATLAB

Since kaldi makes use of a unique ID to read the ark files, we will need to follow the same format. As mentioned before, we will use the already prepared FBANK features as a reference file to generate our custom features. In order to do this with MATLAB, we will need to read and write the ark files using MATLAB.

The research page of Hynek Boril offers matlab routines for reading and writing ark files. The codes can be found here. It offers three MATLAB files:

  1. arkread.m : For reading ark files
  2. arkwrite.m : For writing ark files
  3. ark2scp.m : For generating scp files from ark files

So the general procedure is follows: Use arkread.m to read the ark files which will return a header mat and a feature matrix. The header mat contains the unique id of the utterance and the feature size related information. feature mat is a huge matrix that contains all the feature vectors. The idea is to change the feature mat with our custom features, change header mat accordingly and to write it using arkwrite.m. Then we will use ark2scp.m to generate the corresponding .scp file.

A sample MATLAB code is given below.

setlist = {'test', 'train', 'dev'}
myfeatname = 'myfeats';
splits = 10; % the feature extraction is typically split into 10, change this otherwise

fbankdatadir = fullfile(codedir, 'data-fbank');
fbankfeatdir = fullfile(codedir, 'fbank');

datadir = fullfile(codedir, ['data-', myfeatname]);
featdir = fullfile(codedir, myfeatname);

% create dirs
system(['mkdir -p ' datadir ' ' featdir])

for setnum = 1:length(setlist)
    setname = setlist{setnum};
    % copy files
    system(['cp -r ' fbankdatadir '/' setname '  ' datadir '/'])

    filelist = fullfile(datadir, setname, 'wav.scp');

    fid = fopen(filelist, 'r');
    filenames = textscan (fid, '%s %s');
    numfiles = length(filenames{1});

    fileindex = 1;

    for splitindex = 1: splits
        disp (['Processing ' num2str(splitindex) ' of ' num2str(splits) ' split data.']);

        srcark = strcat('raw_fbank_', dtag, '.', num2str(splitindex), '.ark');
        srcark_filename = fullfile(fbankfeatdir, srcark);

        destark = strcat('raw_', myfeatname, '_', setname, '.', num2str(splitindex), '.ark');
        destscp = strcat('raw_', myfeatname, '_', setname, '.', num2str(splitindex), '.scp');
        destark_filename = fullfile(featdir, destark);
	destscp_filename = fullfile(featdir, destscp);

        [HEADER_MAT, FEATURE_MAT] = arkread(srcark_filename);

        HEADER_MAT_NEW = cell(size(HEADER_MAT,1), 5);
        FEATURE_MAT_NEW = zeros(size(FEATURE_MAT,1), featdim); % expects same frame dimension as FBANK

        framestart = 1;

        if lt(splitindex, splitstart)
            fileindex = fileindex + size(HEADER_MAT, 1);

        for filenum = 1 : size(HEADER_MAT, 1)

            disp(['File index is ' num2str(fileindex) ' of split set ' num2str(splitindex)])

            % verify the id
            if ~strcmp(HEADER_MAT{filenum,1}, filenames{1}(fileindex))
                error ('ID mismatch!')
	    % read the audio file, the filename is typically the second column in wav.scp
	    % else, find the right column index in wav.scp and change the cell column below
	    [x, fs] = audioread(char(filenames{2}(fileindex)));

	    % Extract custom features here
	    newfeat = extract_my_feats(x);
            % expected shape of newfeat is (numframes x featdim)

	    % check if it has the same number frames as in FBANK
	    % else, it is an error if we need to reuse alignments
	    % if you dont need to use existing alignmnents
	    % comment out this check
            if ne(HEADER_MAT{filenum,2}, size(newfeat,1));
                error ('Dimension mismtach!')

            frameend = framestart + size(newfeat,1) - 1;
            % fill in header mat details
            HEADER_MAT_NEW{filenum,1} = HEADER_MAT{filenum,1};
            HEADER_MAT_NEW{filenum,2} = size(newfeat,1);
            HEADER_MAT_NEW{filenum, 3} = featdim;
            HEADER_MAT_NEW{filenum, 4} = framestart;
            HEADER_MAT_NEW{filenum, 5} = frameend;
            % add features to featuremat
            FEATURE_MAT_NEW (framestart:frameend,:) = feat;

            fileindex = fileindex + 1;
            framestart = frameend + 1;

        % write the ark file
        disp ('Writing ark files...')
        arkwrite(destark_filename, HEADER_MAT_NEW, FEATURE_MAT_NEW);

    end % splitindex

end % set num

Now the features successfully stored to ark files and the corresponding scp files are generated. We will next have to generate the feats.scp file by concatenating the scp files which are split into 10 subsets. You can do it in a terminal using:

for x in test train dev; do
    rm -f data-myfeats/$x/feats.scp
    for n in $(seq $nj); do
      cat $feat/raw_myfeat_${x}.$n.scp || exit 1;
    done > data-myfeats/$name/feats.scp

Now you can change the data directory in training scripts to data-myfeats to train and decode using kaldi using your custom features.

Custom features using Python

As mentioned in the above section, snce kaldi makes use of a unique ID to read the ark files, we will need to follow the same format. For this we will use the already prepared FBANK features as a reference file to generate our custom features. In order to do this with python, we will need python routines to read and write ark files.

The kaldiio is a pure python module for reading and writing kaldi ark files. You can install it to your machine/environment using pip.

pip install kaldiio

We will make use of the load_ark and save_ark commands. So the general procedure is follows: Use load_ark to read the ark files which will return the unique id and the corresponding feature matrix. The idea is to change the feature matrix with our custom features, use it with the unique id and to write it using save_ark. We will also use save_ark to generate the corresponding .scp file.

A sample python code is given below.

import kaldiio
import os
import numpy as np
import shutil
import scipy as sp
import scipy.signal as sp_sig

def copy_and_overwrite(from_path, to_path):
    if os.path.exists(to_path):
    shutil.copytree(from_path, to_path)

basedir = os.getcwd()
setlist = ['test', 'train', 'dev']

featdim = 64 # insert your feature dim here
tags = 'myfeats'
splits = 10

fbankdatadir = os.path.join(basedir, 'data-fbank')
fbankfeatdir = os.path.join (basedir, 'fbank')

datadir = os.path.join(basedir, 'data-'+ tags)
featdir = os.path.join(basedir, tags);

# create dirs if they dont exist
if not os.path.isdir(datadir):
if not os.path.isdir(featdir):

for setnum in range(len(setlist)):

    setname = setlist[setnum];
    print ('Processing ' + setname + ' data.')

    # copy files
    src_dir = os.path.join(fbankdatadir, setname)
    dst_dir = os.path.join(datadir, setname)
    copy_and_overwrite(src_dir, dst_dir)

    filelist = os.path.join(datadir, setname, 'wav.scp')
    dtag = setname

    filenames = []
    keys_orig = []
    for line in open(filelist):
        filenames.append(line.split()[4]) # wavenames is in the 5th column
        keys_orig.append(line.split()[0]) # key or id is in the first column
    numfiles = len(filenames)

    fileindex = 0

    for splitindex in range(splits):
        print ('Processing ' +  str(splitindex) + ' of ' +  str(splits) + ' split data.')

        srcark = 'raw_fbank_' + dtag + '.' + str(splitindex + 1) + '.ark'
        srcscp = 'raw_fbank_' + dtag + '.' + str(splitindex + 1) + '.scp'

        destark = 'raw_' + tags + '_' + dtag + '.' + str(splitindex + 1) + '.ark'
        destscp = 'raw_' + tags + '_' + dtag + '.' + str(splitindex + 1) + '.scp'

        srcark_filename = os.path.join(fbankfeatdir, srcark)
        destark_filename = os.path.join(featdir, destark)

        srcscp_filename = os.path.join(fbankfeatdir, srcscp)
        destscp_filename = os.path.join(featdir, destscp)

        d = kaldiio.load_ark(srcark_filename)
        write_dict={} # kaldiio uses features in the form of a dict
        for key, array in d:
            # check if keys match
            if keys_orig[fileindex] != key :
                raise ValueError('ID Mismatch!')
            fname = filenames[fileindex]
            # read audio, assuming wav file
            # change this if the file is of some other type
	    fs, sig =

            # Do your feature extraction below: 
            # expected shape is (numframes x featdim)
	    newfeat = extract_my_feats(sig)
	    # do the numframes check, if we are reusing existing alignments
            # else, comment out this check
            if array.shape[0] != newfeat.shape[0] :
                raise ValueError('Dimension mismatch!')

            # append the features into the dictionary
            write_dict [key] = newfeat.astype(np.float32) # let us store it as float32
            fileindex += 1

        # write features to disk
        print ("Writing to " + destark_filename)
        kaldiio.save_ark(destark_filename, write_dict, scp=destscp_filename)

Notice that kaldiio.save_ark with the scp variable set to a destination name will also generate the scp file together with the ark write. We will next have to generate the feats.scp file by concatenating the scp files which are split into 10 subsets. You can do it in a terminal using:

for x in test train dev; do
    rm -f data-myfeats/$x/feats.scp
    for n in $(seq $nj); do
      cat $feat/raw_myfeat_${x}.$n.scp || exit 1;
    done > data-myfeats/$name/feats.scp

Now you can change the data directory in training scripts to data-myfeats to train and decode using kaldi using your custom features.

Deepak Baby
Deepak Baby
Applied Scientist

My research interests include speech recognition, enhancement and deep learning.

comments powered by Disqus