Parsimonious modelling of spectroscopy data via a Bayesian latent variables approach
Infrared spectroscopy techniques represent a convenient and non-disruptive way to rapidly collect large amounts of data. Nowadays, these data are effectively used in several different frameworks such as medicine, astronomy and food science. Nonetheless, from a statistical perspective, they pose some relevant challenges mainly due to their high-dimensionality and to the peculiar relationships among spectral variables (wavelengths), often due to convoluted chemical processes. In this scenario, factor analysis represents a sensible strategy, as it aims to produce parsimonious representations of the data while focusing on the correlation structures. However, features redundancy, a troublesome issue when dealing with spectral data, has been to a great extent overlooked. Therefore, a modification of factor analysis is proposed, which maps the data into a lower dimensional latent space while simultaneously clustering the variables. A flexible Bayesian estimation procedure is then considered to fit the model. On one hand, this approach results in an even more parsimonious summary of the data, highlighting which wavelengths carry similar information. On the other hand, from an interpretative point of view, the obtained partition produces useful insights from a chemical standpoint. The method is applied on milk mid-infrared spectroscopy data from cows on different feeding regimens, providing a useful tool to guarantee milk authenticity.