Data Acquisition Engineer
Data Acquisition Engineer, a full-time remote position focused on developing systems for large-scale web crawling and data acquisition to support the training of frontier models for software development.
Key ResponsibilitiesDesign and operate a large-scale web crawler for acquiring publicly accessible data
Develop specialized crawlers targeting high-value sources to enhance data recall
Collaborate with teams to align data acquisition with model training needs and build ingestion pipelinesRequired QualificationsStrong background in distributed systems and experience with large-scale data pipelines
Proficiency in Python and experience with web crawling or large-scale data extraction
Familiarity with cloud platforms (AWS) and container orchestration (Kubernetes, Docker)
Understanding of data privacy and responsible crawling practices
Experience in building pre-training datasets for large language models is a plus