P02-22 CBI2025

P02-22

Implementation of a Chemical Structure Database System Bridging Open Science and Drug Discovery

Seiji MATSUOKA *¹, Akiko IDEI¹, Minoru YOSHIDA^{1, 2, 3}

¹CSRS Drug Discovery Seeds Development Unit, RIKEN
²Office of University Professors, The University of Tokyo
³Collaborative Research Institute for Innovative Microbiology, The University of Tokyo

[Background]
In the field of drug discovery in academia, which relies on collaboration among diverse partners, the management of chemical structure information remains a major challenge. While confidentiality management is required to support intellectual property strategies, social demand for open science—such as the creation of new value through data sharing—is rapidly increasing. Introducing and maintaining an in-house developed system or commercial database packages imposes high costs on academia and makes it difficult to adapt flexibly to new modalities and laboratory automation. To address these issues, we designed and implemented a lightweight database system using open-source software.

[Implementation]
Chemical structures, metadata, and calculated parameters stored in a PostgreSQL database are provided through a Web API automatically generated by PostgREST[1]. End users can extract data and perform structure searches via Web applications built with Streamlit[2], and export the search results in desired formats. For data input, we established an automated deployment process based on an ETL (Extract/Transform/Load) architecture, equipped with preprocessing workflows to convert heterogeneous data files into formats suitable for the database. Access control based on user attributes is achieved through KeyCloak[3] for single sign-on (SSO) and PostgreSQL Row Level Security (RLS). These systems are deployed within HOKUSAI SailingShip (HSS)[4], a data science infrastructure provided by RIKEN.

[Results and Discussion]
The entire system is defined in a Docker Compose file, which simplifies development, maintenance, and migration across cloud environments. ETL architecture reduces the operational burden of data updates, ensures reproducibility of preprocessing workflows, and enhances traceability to original data. Since SSO is shared with other cloud-based systems (e.g., assay databases, data analysis applications), it not only improves user convenience but also enables fine-grained access control, supporting the secure integration of confidential data.

[Conclusion]
Confidentiality management has long been a barrier to data sharing in drug discovery. A lightweight information management system that balances confidentiality and accessibility expands the strategic options for academic data management and is expected to promote broader data utilization in the context of open science.

[References]
[1] PostgREST https://github.com/PostgREST
[2] Streamlit https://streamlit.io/
[3] KeyCloak https://www.keycloak.org/
[4] Data Science Infrastructure HOKUSAI SailingShip https://i.riken.jp/data-sci/