Purdue University - Department of Computer Science - Economics Prof Seeking Assistance with Web Scraping & Data Extraction
Skip to main content

Economics Prof Seeking Assistance with Web Scraping & Data Extraction

Joe Mazur, Assistant Professor of Economics in the Krannert School of Management, requires assistance with web scraping and data extraction from scanned images (in PDF format). This is a short-term project, which if successful, will require more assistance at a later date. The data source is PACER.gov, which hosts electronic court records. We already have a Python-based web scraper built and have extracted several hundred gigabytes of raw data in PDF and HTML form. However, if all goes well, we will be running that program on a massive scale sometime soon, so we will need someone who knows how to handle that process. We have already extracted some usable qualitative and quantitative data from the most recent PDF and HTML files, but more is necessary for the analysis. Three major tasks require attention: First, many of the PDFs are simply scanned-in images, and the methods we have tried (Adobe’s OCR and Python’s pytesseract) are not returning usable results. Second, the format of the PDF forms changes from year to year, so the program we have right now, which works fairly well on newer files, needs to be modified in order to continue extracting data from older files. Third, there is a particular category of data that requires a bit more searching (both among candidate PDF files as well as within a selected PDF file) to find for any given court case, and we do not yet have any programs written to find or extract it.

 

The ideal candidate will have successful experience (either in class or on the job) with web scraping and data extraction tasks like those described in the above paragraph. Excellent grasp of Python is required, unless the student can make a case for doing a better job without it. We would prefer a candidate who will be on campus this summer as well as next fall (and maybe into spring), but we will consider someone who is only available this summer. If necessary, we may hire two students.

 

To apply, please email Joe Mazur at mazur3@purdue.edu. Be sure to

1) attach your resume/CV;

2) indicate in the body of your email what classes/skills/experience you have that would enable you to be successful in this position; and

3) let us know which semesters you expect to be available.

 

It would also help a great deal if you could list a professor with related expertise as a reference, but it is not required.

Posted by: Jane Do  Wed, 19 Apr 2017, 12:00 AM

Last Updated: May 15, 2019 10:06 AM

Department of Computer Science, 305 N. University Street, West Lafayette, IN 47907

Phone: (765) 494-6010 • Fax: (765) 494-0739

Copyright © 2018 Purdue University | An equal access/equal opportunity university | Copyright Complaints

Trouble with this page? Disability-related accessibility issue? Please contact the College of Science.