Making BOJ Crawler

2025. 05. 04.

I had the following thoughts while creating a project to crawl solved problems from Baekjoon based on user ID.

1

The first reason I started this project was to organize the problems I've solved so far with my new Baekjoon account. Since I haven't solved even 200 problems yet, manually compiling the data by going through my submission list wouldn't have taken that much time. In fact, this might have been faster than researching how to write crawling code and building a crawler from scratch. There's an old saying that a developer is someone who spends 2 hours writing code to complete a 30-minute manual task in 3 seconds.

In this situation, I could consider whether spending time becoming a "human crawler" to collect data manually or spending time writing crawler code was a more valuable use of my time. To defend writing the crawling code, I could argue that even though it took longer, I can reuse this crawler for similar crawling tasks in the future. To defend the manual approach, I could argue that once I've compiled all the data, I would only need to update it incrementally as I solve more problems. Using a crawler would be like using a sledgehammer to crack a nut, and manual work might be the more reasonable choice in this case.

However, things seem to have changed with the arrival of AI tools. I simply provided a link where my solved problem list could be checked and said "crawl this," and Cursor immediately wrote code that organized the solution number, problem number, and programming language into a list. When I mentioned that each page only shows 20 records and asked it to navigate through all pages to compile the complete record, it immediately wrote that code too. What would have normally taken me at least tens of minutes to initialize the project, examine the HTML, and figure out how to extract the necessary information, now produces working code in just one minute. There's absolutely no reason to become a human crawler to organize Baekjoon problem-solving records anymore. Manually crawling data would essentially be throwing away tens of minutes of my life.

2

The second reason for creating the crawler project was because I needed to collect solving records from multiple people. Recently, I created an open chat room called "Today's Problem Solving Complete" and recruited people who would solve at least 10 Baekjoon problems per month. How would I verify that the people in this room are actually solving problems? By checking their solving history using their Baekjoon IDs. Of course, if there aren't many participants, I could just manually check their Baekjoon profiles at the end of each month. But since the room can grow indefinitely, having a crawler would definitely be useful.

In the past, I might have used my existing code to crawl the entire solving history every month and then separately written code to count the number of problems solved in the current month. I would have assumed that the Baekjoon server already handles plenty of traffic, so my additional requests wouldn't be a problem. Since running the crawler code for a few extra minutes wouldn't affect my ability to implement the desired functionality, I probably would have decided not to bother modifying the existing code to extract solutions from only a specific month.

But now, I can simply ask Cursor to add an argument in the form of 'yyyymm' to store only the solving records for that period, and since the solutions are sorted by date, to stop crawling when it reaches submissions before the specified period. Once again, it immediately wrote working code. If it's this easy, shouldn't I use the energy I would have spent finding reasons to justify my laziness to simply click on Cursor and add functionality? It's truly a frightening world.

3

If I were to further improve the crawler, I could make it avoid re-crawling information that's already been collected based on submission IDs. I could also implement version management for the crawled information format, so that if the format of newly crawled solutions doesn't match the existing records, the entire crawling process could be restarted.

Perhaps what developers in this new era need to know is how to plan such functionality improvements? Now that simple code writing (or generation) is something anyone can do, it's difficult to differentiate oneself with just that skill. But planning is a little different. This might be an area that non-developers who create projects solely through "vibe coding" have not yet accessed. You need to be able to imagine how a project could be improved, and also judge whether such improvements would bring practical benefits (or whether they would require too much effort for little gain). From the perspective of someone who can't write code and only looks at generated code, there's no way to know why certain code wasn't generated, or whether the code they're looking at is the best solution.

Back to post list