With a little help from our friends: the eScience - ARISE Hackathon
Once Arise is running, it will be easy to ask it things like “Show me a list of all species of butterflies for which we have barcodes from Dutch specimens . And all those barcodes, please.” But how do we get there?
The ARISE Biocloud team is working hard on the system that will manage all the data that we want to handle and make accessible through ARISE. With all the different data types that you can collect to monitor biodiversity, such as images, sounds, radar, DNA data, algorithms, we have realised we need a bit more than the traditional data warehouse software solutions. Over the past months, the development team has worked on a data lakehouse prototype using Delta Lake, which is an open source framework designed for handling big data.
Many hands make light work
Our first data lakehouse prototype is designed for managing image data from Diopsis insect cameras. And it was about time to test whether this prototype can also be applied to a different type of data. We received a very generous offer from the eScience Center providing 5 software developers to join the ARISE team for a week. This was an excellent opportunity to share our knowledge about the data lakehouse concept and current prototype, and at the same time use the extra resources to build a prototype for managing DNA data. Together with a group of developers from the Naturalis Application Development and Infra teams, we had a great group of people working together.
From user stories to data pipelines
“Show me a list of all species of butterflies for which we have barcodes from specimens collected in the Netherlands.” This could be a search that a future ARISE user wants to perform, resulting in a list of species, but maybe also in the option to download a file that includes all those barcodes. Such a file can then be used to match a barcode of an unidentified specimen for species identification. To be able to show all the relevant results we need a data system that connects information about the specimen with the sequence data and lab process data. For the hackathon we used a simple data model, but we will add other types of information, such as project metadata, in the future as well. Once you have all the data managed correctly, you can connect it to a search system and enable a user to download the files.
A fruitful week
Throughout the week the developers built and compared multiple ways to store sequence data, search through the metadata and return results of a search. We gained a lot of insight by testing solutions, figuring out what works and what does not work as well as we thought. Maybe we did not end with a finished prototype, but we are so much further in understanding the needs for storing and managing DNA data. This work will be continued by ARISE and Naturalis developers.
We would like to say a big thank you to the eScience Center for providing their support. It was great to work together with such an enthusiastic group of developers, share ideas and knowledge and generate a prototype that gives us a good foundation for further work on making genetic data better available.
Comentarii