This week went exceptionally well for me. Not only did I finally move to my new apartment, I also got to hang out with my friends and sing some karaoke. But the best part was the call from my best friend who currently lives on the other side of the world (Jerusalem, Israel) and works for a non-profit organization. As soon as I picked up the phone she announced that she just got admitted to U Penn Law program and she starts in September 2015. Now she is waiting for NYU to respond and if NYU extends its acceptance she will choose NYC over Philly (where U Penn campus is). I was super excited anyway. For one thing, I was psyched to learn that she is going to Ivy League school and for the other thing, this meant that I would get to see her every weekend at a minimum (it is only an hour to get from Philly to NYC)! This is how I came up with an idea for today's Start Data Science Simple blog post. I know that many people dream to study at Yale and Harvard Law Schools, and the biggest question for them is - "What should I do to get accepted?" I will take a quantitative approach to answer this question.
I was just wondering how many applicants are now waiting for application decisions from their dream Law programs. I'm guessing thousands... But I found 7,512 applicants who were brave enough to register with Top Law Schools and share their application details with the world. Here is what they reported:
- Applicant's user name
- Highest LSAT score
- College GPA
- Schools they applied to
- Schools they were accepted to
- Schools they were rejected by
Unfortunately, this website doesn't provide download capabilities, so in order to use this data I had to scrape the website. In my Oscar blog post I also had to use a scraper to obtain movie box office data, and later on I got some requests to share the data with other data science enthusiasts. So this time around I will share with you all how I typically use scraper tools so you can do it yourself :)
There is an awesome tool designed for non-programmists like myself who need to scrape structured data from websites that don't provide download capabilities. Check out KimonoLabs where they explain how their tool works and how it literally takes minutes to get the data you are looking at into csv or json format (Side note: I have nothing to do with this company, but just thought their product deserves this recognition).
After I specified what data objects I want to extract the hard part began. There are 100 users listed on the page and there are 76 pages in total, so ~7,600 rows needed to be extracted. KimonoLabs recommends to use their pagination capabilities to continuously extract data from all pages, but this only works if there is a button "Next" that allows a user to go to the next page. In my case this button was absent, so I had to extract the data and then enable targeted crawling with generated URL list where I specified that I need continuous data extraction from page 1 to page 76.
I also noticed that if I click on each individual username (I had to register on the website first), I will be able to see the user's profile that includes information like user's gender, school s/he is attending, application year and college major. I decided to scrape this information as well. Finally, I synced Kimono API with Google spreadsheets and extracted my data into a google spreadsheet.
Now when I got my data I realized that before I can do any analysis I have to transform it in a way that allows slicing and dicing the data, creating visualization and performing statistical analysis.
Chart 1 shows how I got my raw data from Kimono into Google spreadsheets. The main problem with original data file was that information in Applied, Accepted and Rejected columns was very unstructured. These fields list all schools an applicant applied to, got accepted or rejected by, but since my data is represented at the applicant level, I had to have a better way to link students to each school.
In my opinion, decisions related to data transformations have the biggest impact on the data analysis because data transformation:
- Is the most time-consuming step in data analysis process (well, maybe except for the time spent on looking for the data itself).
- Directly impacts difficulty level of creating visualization from the data.
- Facilitates statistical analysis to prove/disprove hypotheses.
So here goes...
My Lesson #2
Before getting into the weeds of the data, think about what question you are trying to answer; then prioritize information you think is essential to answer your question; finally, decide on data transformations that will leave you with the least amount of fields (columns).
Again, I took my own lesson and went back to my main question.
What should I do to get accepted by my dream Law program?
Why do I need to present applicants uniquely in order to answer this question? In the end of the day, Tatiana with LSAT 170 and GPA of 3.7 may be accepted by Columbia and rejected by Yale, but from the application standpoint there will be two Tatianas who applied to two programs, but one case would be accepted and the other one would be rejected. So it makes more sense for me to have a dataset where applications are uniquely presented, not people. In other words, I will have a dataset where each row is a unique combination of username, LSAT score, GPA, School applied to and status of application (Accepted/Rejected).
Since my best friend got into U Penn I decided to only focus on the top 10 schools in the US Law School ranking (which includes U Penn) because it is a pain in the neck to manually transform free text into columns and then into rows (I did it in excel). Here are 3 steps I followed:
- I created 10 additional fields for each school (from Yale to U of Michigan)
- For each field Applied/Accepted/Rejected I used filter to find schools on my list. If an applicant was accepted by this school, I would assign "Accepted" to the cell in the relative school's column. The same story was with "Rejected". If an applicant "Applied" to a school but did not include it in the "Accepted" or "Rejected" columns I assigned this school to "Rejected" by default.
- I turned School columns into rows and created a logical formula that labeled rows as "Accepted" or "Rejected" based on the value in the cell.
Chart 2 shows how the dataset looked like after I completed my transformations.
Now every row no longer represents a unique applicant, instead it represents a unique "case". In other words, a combination of username, LSAT, GPA, School and Status forms a unique row.
Chart 3 shows the dataset I worked with after I appended students' application year (when a student applied) and their college major (I got this data from scraping each individual's profile page).
First, one of the things I was interested in was the "score profile" of students who applied to top 10 Law Schools. What were their LSAT and GPA scores? Were these applicants accepted or rejected? Do all top schools accept students with the same GPA and LSAT scores or is there a difference?
Chart 4 shows average GPA and LSAT scores broken up by school.
It looks like top schools are accepting top students (what a surprise?) with average GPA ranging between 3.7 - 3.9 and highest LSAT score between 170.6 - 174.0 (out of 180). So second, I wanted to know if there are any outliers in Law program choice. In other words, are all students who get accepted by Yale and Stanford in the first 1% based on GPA and LSAT? Or maybe there are applicants who did not have as high GPA and/or LSAT but still made it to the program? Does college major play any role in these schools' decision-making process?
Chart 5 shows visualization I put together to answer these questions. This visualization is interactive. On the top you can see a scatterplot that shows each applicant's score (GPA & LSAT) along with his/her acceptance status (Accepted/Rejected) for all top 10 schools. On the right you can select a school you are interested in (my BFF will most likely click on U Penn and NYU) and see only students who applied to this school and their acceptance status. You may also select application year and see historical dynamics of applications. On the bottom you can see breakout of application status by college major. This histogram also transforms as you filter by individual Law school and/or application year.
As mentioned earlier, top schools demand top talent. If you want to be accepted by Yale, Harvard, Stanford and Columbia Law School on average you will need GPA of 3.8 and LSAT of 173-174. Oh well...what if you don't? Any chance to still get accepted?
It looks like you still have a chance ;) There were ~40 students (out of 600) who had LSAT < 173 and GPA < 3.8 and still made it to the top 10 Law Schools.
Another big factor in Law school admission decision-making process, it seems, is applicant's major in college. In general, students with Engineering, Economics and Philosophy major are more preferred by Law schools than students with Business or Journalism degrees. Students with Law major (my BFF has Law Major) have 57% of being accepted by the top 10 Law school, 100% by U Penn (Limitation: there were only 2 students with Law major who applied to U Penn) and 60% of being accepted by NYU (where she has not heard from).
There are a lot of conclusions that can be derived from this exploratory analysis. What did you learn?
Or maybe there are other ways this data could be presented? How would you look at it?