NICE- Non-linear Independent Components Estimation: Insights and Implementation in Keras
Keras implementation can be found here.
Flow-based deep generative models have not gained much attention in the research community when compared to GANs or VAEs. This post discusses a flow-based model called NICE, its advantages over the other generative models and finally an implementation in Keras.
While VAEs use an encoder that finds only an approximation of the latent variable corresponding to a datapoint, GANs doesnt even have an encoder to infer latents. In flow-based models, the latent variables can be infered exactly without any approximation. Flow-based models make use of reversible architecture (which will be explained below) which enables accurate inference, in addition to providing optimization over the exact log-likelihood of the data instead of a lower bound of it.
Flow-based generative models
In flow-based approaches the generative process for a datapoint
where
The function
Given an observed data variable
The determinant of the Jacobian matrix
Thus, flow-based models require two important design choices on
- Have reversible architecture
- Design transformations whose determinant of Jacobians are easy to compute
To satisfy these two requirements, the trick is to choose the transformations whose Jacobian is a triangular matrix, such that their determinnant can be simply computed as the product of its diagonal elements. Thus,
These models are trained (i.e., training the neural nets
To generate data, we can sample from the prior distribution
NICE: Non-linear Independent Components Estimation
As mentioned before, the two key main aspects of flow-based approaches is easy determinant of the Jacobian and easy inverse. In NICE, the input data is split into two blocks
The inverse can be easily computed as:
The transformation
is (where and are the dimensions of and ) resulting in a Jacobian matrix whose determinant is unity. Notice that such a design not only enables easy compuatation of the determinant, but also lets us choose arbitrarily complex since we dont have to compute its derivative to obtain the determinant.
Similarly, inverse operation from
In the NICE model, since all the transformations are volume preserving (unit Jacobian determinant), the resulting transformation will have equal weight over all dimensions, which is not desirable in practical applications. To address this, NICE also includes a scaling layer at the output that scales every dimension by a trainable weight
Thus the nice criterion becomes maximizing the log-likelihood of the data distribution:
Further, NICE model assumes that the prior distribution is factorial:
- For standard gaussian:
- For standard logistic:
As we stack multiple affine coupling layers to obtain more complex transformations. Since the transformation leaves one part of the data leaves unchanged, we can alternate the role of each part in subsequent coupling layers. Typically, 4 coupling layers are used so that all dimensions influence the one another. The scaling layer is paramterised exponentially
In NICE the forward model maps the datapoint to the latent space and is trained to minimize the negative log-likelihood with respect to some prior distribution. And for inference, we sample from the prior distribution to get
Forward Flow | Inverse Flow |
---|---|
Implementation Notes
As mentioned above, the NICE model is trained to minimize the negative log-likelihood with respect to a standard Gaussian or logistic distribution.
Getting the models correct: The main design challenge was to get the inverse model correct with its weights tied to the corresponding forward model. We can verify if the inverse model is correct even before training the model. Just take a test image from MNIST and pass it through the forward model. Then use the resulting output as input to the inverse model. If the inverse model is created with correct weights, it should yield the test image.
Getting NaNs: For implementing the logistic loss
, I initially used Keras backend as:
import keras.backend as K
logistic_negloglikelihood = K.sum ( K.log (1 + K.exp(z)) + K.log(1 + K.exp(-z)), axis=1 )
However, it was resulting in NaN after a few epochs. Then this was replaced with softplus
function which computes
logistic_negloglikelihood = K.sum ( K.softplus(z) + K.softplus(-z), axis=1 )
Initialization: Initially, I was using the default
glorot_uniform
initialization of keras which was not resulting in the log-likelihoods given in the paper. Dinh’s original implementation in Theano used an initialization from a uniform distribution in . Using the same initialization for the Dense layers resulted in better log-likelihoods.Batch-size: Dinh used Adam optimizer with a learning rate
and batch size of 256. In my experiments, I observed that increasing the batch size from to consistently yielded better log-likelihoods.Clip the outputs during generating data: MNIST data is rescaled to the range
for training. But during inference, the output may not be always in . Therefore, we apply clipping on the generated outputs to bring them to the desired range.