Back to events

Data Pipelines for Science Spring School

How can researchers design and implement data pipelines for scientific research? Our Data Pipelines for Science School will help scientists to learn how to correctly, efficiently and robustly prepare your datasets for machine learning in your scientific projects.

The Data Pipeline for Science school was originally launched in Winter 2022, encouraged by positive feedback and increased demand, we are pleased to announce that Accelerate Science will run the school again as the Spring School 2023.

Well-curated and managed data is central to the effective use of AI, in science and elsewhere. How can scientists build the data pipelines they need to accelerate their research with AI?

Machine learning is an important tool for researchers across disciplines. Scientists today have access to more data, from a greater range of sources and at greater speed than ever before, and opportunities to extract insights from this data using AI. But before deploying AI, researchers must have a data pipeline that transforms their data into a state that is suitable for the machine learning algorithms being used.

These pipelines are important independent research outputs as they enable others to easily inspect, reproduce, refine or extend the scientist’s work. However, implementing data pipelines present numerous software challenges that might be difficult to resolve or even identify to scientists who do not have a significant expertise in software engineering concepts and practices.

Such challenges include: how do I ensure the correctness of my pipeline? How do I structure my pipeline in a way that makes it easier for others to reuse and extend? How do I ensure my pipeline is robust enough to deal with different types and volumes of data? How do I document and publish my pipeline?

Accelerate Science’s Data Pipelines for Science School helps scientists overcome such data pipeline challenges by equipping them with the latest best-practice software techniques. It consists of a blend of lectures and labs, with a focus on discussing general principles and case-studies during the lectures, and a focus on hands-on exercises in Python during the labs. Participants also have the opportunity to discuss and share data pipeline issues encountered in their own research with the course instructor and cohort, and to relate it to the course content.

FAQs

Who is this course for?

Any researcher at the University of Cambridge who works with large datasets in their research and is interested in making the transition to data science-led research, but does not have significant expertise in software or data engineering.

How will this course help my research?

Any data-intensive research requires implementing data pipelines (even implicitly), and as mentioned in the course overview, this course will help researchers identify and resolve numerous design and implementation challenges that arise when using data pipelines in research. Implementing the methodology taught in this course will help ensure that all your data pipelines and data handling runs efficiently, correctly and can be shared frictionlessly with collaborators; which are beneficial to any scientific research programme.

What is the minimum level of programming requirement to participate in the course?

There are no minimum programming skills requirements to enrol for the course. However we ask participants to complete an online Python course before the school starts to ensure all enrolled students have the minimum required programming level needed for the course. For further details on how to access the free online course you can visit: https://acceleratescience.github.io/resources/python-programming-for-science.html

Do you cover any other programming language apart from Python?

All the labs will be conducted in Python, however the pipelines concepts discussed can be easily applied to any other language

Do I need to be familiar with machine learning to participate?

While implementing a data pipeline is an important preparatory step for using machine learning, the focus of this course will be solely on data pipelines and thus you do not need to be familiar with machine learning to participate.

What topics will the course cover?

Day 1 Wednesday 29th March

10:15 - 10:30 Registration

10:30 - 11:15 Lecture 1 “Introduction to data pipelines”

11:35 - 13:00 Lab 1 “Automating pipelines in Python”

14:00 - 14:45 Lecture 2 “Testing and profiling data pipelines”

15:05 - 16:30 Lab 2 “Examples of testing and profiling data pipelines in Python”

Day 2 Thursday 30th March

10:30 - 11:15 Lecture 3 “Publishing Data Pipelines & Course Summary”

11:35 - 13:00 Lab 3 “Publishing workflow”

13:00 - 14:30 Pizza lunch & certificate collection

14:30 - 16:00 Open office hour with Accelerate Science team

How do I apply?

To apply, candidates should complete this form by 5pm on Friday 10th of March. Candidates who are accepted will be notified by Tuesday 14th of March.

Please note that spaces are limited. We prioritise candidates who express in the application form an immediate and pressing research need for learning the topics presented in the School, while also trying to maintain a balanced cohort in terms of academic background, research interests and career levels. For further information, please contact accelerate-science@cst.cam.ac.uk

How much does the Spring School cost?

The Spring School is available free of charge for Cambridge researchers.

Where does the Spring School take place? The Spring School will take place in the Department of Computer Science and Technology at the William Gates Building.

Event Details