
Accurate meal intake estimation is important for diet monitoring and diabetes management, but calorie prediction from a single data modality can be limited. CGM-based models capture post-prandial glucose responses, while food image models capture visual information about meal content; however, each modality alone may miss important nutritional information. This paper proposes a multimodal inverse metabolic model that combines CGM data and food photographs for calorie estimation. The model extracts glucose representations using an attention-based Transformer and Gaussian area-under-the-curve (gAUC) features, while food image representations are extracted using a Vision Transformer. These modality-specific embeddings are combined using a late-fusion projector network to generate calorie predictions. The authors evaluate the model on data from 27 participants who wore Freestyle Libre Pro CGMs and consumed breakfasts and lunches with known caloric content. The joint CGM-image embedding achieved the best performance, with an NRMSE of 0.34 and a correlation of 0.52, outperforming CGM-only and image-only baselines. The results suggest that combining multiple views of meals can improve automated diet monitoring technologies.