by Daniela Vergara, Reggie Gaudino, Thomas Blank, Brian Keegan
The widespread legalization of Cannabis has opened the industry to using contemporary analytical techniques for chemotype analysis. Chemotypic data has been collected on a large variety of oil profiles inherent to the cultivars that are commercially available. The unknown gene regulation and pharmacokinetics of dozens of cannabinoids offer opportunities of high interest in pharmacology research. Retailers in many medical and recreational jurisdictions are typically required to report chemical concentrations of at least some cannabinoids. Commercial cannabis laboratories have collected large chemotype datasets of diverse Cannabis cultivars. In this work a data set of 17,600 cultivars tested by Steep Hill Inc., is examined using machine learning techniques to interpolate missing chemotype observations and cluster cultivars into groups based on chemotype similarity. The results indicate cultivars cluster based on their chemotypes, and that some imputation methods work better than others at grouping these cultivars based on chemotypic identity. Due to the missing data and to the low signal to noise ratio for some less common cannabinoids, their behavior could not be accurately predicted. These findings have implications for characterizing complex interactions in cannabinoid biosynthesis and improving phenotypical classification of Cannabis cultivars.