Syntax-Directed Variational Autoencoder for Structured Data

Advances in deep learning of representation have resulted in powerful generative approaches on modeling continuous data like time series and images, but it is still challenging to correctly deal with discrete structured data, such as chemical molecules and computer programs. To tackle these challenges, there has been many improvements in formalization of structure generation that help deep models better model complex structured under constraints of limited data. Improving previous works that formalize data as arbitrary string (CVAE) and generated string from context free grammar (GVAE), we propose a new formalization framework that allows both syntactical and semantic constraints of structured data to be used for guiding the data modeling. Our approach facilities a generative model that models data in a much more powerful way with both syntax and semantics constraints.

In our recent paper Syntax-Directed Variational Autoencoder for Structured Data accepted at ICLR 2018, we detailed our new formalization, modeling and experiments. Simply put, our proposed formalization, or SD-VAE, reshapes the output space of generative model more tightly to the meaningful data space, as show in Figure 1, by systematically converting the offline semantic check into online guidance for generating. 

song_iclr18_fig1
Figure 1. The hierarchy of the structured data decoding space w.r.t different works and theoretical classification of corresponding strings from formal language theory.

Even with the increased capacity, our approach has the same computational cost as previous models for efficient learning and inference. Finally it results in significant and consistent improvements across various tasks on chemical molecules and computer programs, showing its empirical value of solving real world problem.

Our method converts the offline Attribute Grammar, an shown in Figure 2, to on-the-fly generative process. In Attribute Grammar, there are inherited and synthesized attribute that are computed offline for ensuring semantic correctness. To enable the on-the-fly computation during tree generation we introduce the stochastic lazy attributes that transforms the corresponding synthesized attribute into inherited constraints in generative procedure; and a lazy linking mechanism sets the actual value of the attribute, once all the other dependent attributes are ready, as illustrated in Figure 3.

song_iclr18_fig2
Figure 2. The offline syntax and semantics check that will be converted to on-the-fly guidance by our method
song_iclr18_fig3
Figure 3. On the fly generative process of our proposed method from (a) to (g). Please refer to the paper for more details

Our formalization works with any generative model, but in practice we choose to have it work with Variational Autoencoder, or VAE, which is an explicit generative model that converts between discrete data space and continuous latent space, for fair comparison with previous models. Empirical results show that our formalization boosts performance in reconstruction accuracy and valid prior accuracy, both important quality metric for VAE, as shown in Figure 4.

song_iclr18_fig4
Figure 4. Reconstruction accuracy and valid prior accuracy, two important metrics for VAE measuring its quality of modeling data. Higher number indicates better results and bolded are best results, which are from our method.

Following GVAE, we use Bayesian Optimization to find new molecules with desired drug-related property.  Our method offers molecules with much better drug-likeliness than previous models, as shown in Figure 5, and similarly computer programs that are close to a particular ground truth, as in Figure 6.

song_iclr18_fig5
Figure 5. Best top-3 molecules and the corresponding scores found by each method using Bayesian Optimization. Higher number indicates better drug likeliness, and molecule with red numbers are best ones, which are from our method.
song_iclr18_fig6
Figure 6. On the left are best programs found by each method using Bayesian Optimization. On
the right are top 3 closest programs found by each method along with the distance to ground truth. Lower distance is better, and our method finds symbolic program, when used as a function, closest to the ground truth.

A few other experiments also suggest that our approaches models the data in a way that similar data are placed near each other (Figure 7) and models diversely, which are desired priorities of good generative model.

song_iclr18_fig7
Figure 7. Visualization of latent space by our method. In this smooth visualization, near points in the latent space are similar to each other, which indicates that  our method figures out a chemically meaningful latent representation of molecule by only looking at molecules, without being taught any domain knowledge (called supervision” in machine learning term) in chemistry.

While our work is the first to tackle challenges of address both syntax and semantics with prior knowledge, and shows strong imperial improvements, there are still more problems to tackle. For example, we would like to explore the refinement of formalization on a more theoretical ground, and investigate the application of such formalization on a more diverse set of data modality.

[Post by: Hanjun Dai*, Yingtao Tian*, Bo Dai, Steven Skiena, Le Song]

 

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.