Posted: May 30, 2025
One of the biggest problems I've had in the past with big data that I've scraped is the lack of technical people who can write the queries necessary to get the information out of the database. This usually falls on me and a few other do-gooders who work with researchers who have questions and create the queries. But with the advent of generative A.I. the queries can now quite literally write themselves.
The project started seeing what data was available online that wasn't available in a searchable way but might be useful if it was. I landed on the Honolulu Police Department's (HPD) arrest records. They are released in PDFs every day but only stay on their website for a few weeks before disappearing forever. I spent a few weeks building a scraper and parser to parse out the information from the PDFs into a mysql database. The scraping was pretty straight forward and simple but the parsing turned out to be far more difficult than I first anticipated. The code essentially converts the PDF into multiple images and then proceeds to break those images apart further so that each image represents a single line of text in the PDF. I'm then able to OCR that line into text and depending on where that text is in the image I know what field it is supposed to be. That is all stored into a database.
But now the hard part how do non technical researchers manage to get that data out? Myself and a fellow student set out to build a web interface using the Streamlit framework that would take a plain language request from the user, convert that request into a SQL query using an OpenAI model, run that query against the database, and then display the results. The prompt to the model essentially contains the schema for the database so it knows what fields exist. We also have to include some specific information about the Hawaii criminal system and some notes about what column to pick if the request is too vague.
The result is a seamless interface where non technical researchers can analyze the last 2.5 years of HPD arrest data. The website is https://www.hpddata.com.