Data Acquisition Engineer
Occupations:
Data Warehousing SpecialistsSoftware DevelopersComputer Systems Engineers/ArchitectsWeb DevelopersData ScientistsIndustries:
Web Search Portals, Libraries, Archives, and Other Information ServicesComputing Infrastructure Providers, Data Processing, Web Hosting, and Related ServicesEducational Support ServicesMedia Streaming Distribution Services, Social Networks, and Other Media Networks and Content ProvidersManagement, Scientific, and Technical Consulting ServicesData Acquisition Engineer, a full-time remote position focused on developing systems for large-scale web crawling and data acquisition to support the training of frontier models for software development.
Key Responsibilities
Design and operate a large-scale web crawler for acquiring publicly accessible data
Develop specialized crawlers targeting high-value sources to enhance data recall
Collaborate with teams to align data acquisition with model training needs and build ingestion pipelines
Required Qualifications
Strong background in distributed systems and experience with large-scale data pipelines
Proficiency in Python and experience with web crawling or large-scale data extraction
Familiarity with cloud platforms (AWS) and container orchestration (Kubernetes, Docker)
Understanding of data privacy and responsible crawling practices
Experience in building pre-training datasets for large language models is a plus