Resum
The use of compound biological fingerprints built on data from high-throughput screening (HTS) campaigns, or HTS fingerprints, is a novel cheminformatics method of representing compounds by integrating chemical and biological activity data that is gaining momentum in its application to drug discovery, including hit expansion, target identification, and virtual screening. HTS fingerprints present two major limitations, noise and missing data, which are intrinsic to the high-throughput data acquisition technologies and to the assay availability or assay selection procedure used for their construction. In this work, we present a methodology to define an optimal set of HTS fingerprints by using a desirability function that encodes the principles of maximum biological and chemical space coverage and minimum redundancy between HTS assays. We used a genetic algorithm to optimize the desirability function and obtained an optimal fingerprint that was evaluated for performance in a test set of 33 diverse assays. Our results show that the optimal HTS fingerprint represents compounds in chemical biology space using 25% fewer assays. When used for virtual screening, the optimal HTS fingerprint obtained equivalent performance, in terms of both area under the curve and enrichment factors, to full fingerprints for 27 out of 33 test assays, while randomly assembled fingerpints could achieve equivalent performance in only 23 test assays.