Internet Archive is a non-profit building a free library of all of the published works of humanity to share with the world. We're not there yet, but we've managed to accumulate some data along the way. Can you help us engineer it? The Archiving and Data Services department provides services to mission-aligned organizations (primarily other libraries and cultural heritage institutions). These services include: web crawling SaaS, managed large-scale crawls, long-term digital preservation, and particularly relevant for this role: making use of these web archives and digital collections. We're looking for a Data Engineer to help us with some of the following: - Turn researcher Jupyter notebooks into robust systems (these notebooks are mostly in Scala) - Develop data munging/wrangling/deriving workflows (we use Spark and Temporal.io) - Help administrate a 7.5 Petabyte Hadoop cluster - Potentially write jobs for our main, in-house long term storage cluster - There's always APIs that need work (these are mostly in Python) - ML experience is an interesting bonus We're fully remote, employees can be based anywhere in US or Canada. This is a new opening as of Dec 1, so new we're still working on getting it posted. If interested, please reach out to Alex at avdempsey [at] archive [dot] org.

Internet Archive

Roles

Tech stack

Work setup

Contact

Description

Similar jobs