Blog

Berkeley Statistics Masters Program

Oct 8, 2020 | 23 minutes read

Tag: blog

How can you prepare for the program?

The stats MA program combines probability, statistics, linear algebra, computer science, and some elements of pure mathematics all together in one. Unless you have a double major in Stats and CS with a minor in math and a full upper division course in linear algebra, you’ll have to learn some things for the first time or brush up on other forgotten material.

Coding / Computer Science:

  • If you don’t have coding experience, consider practicing R, SQL, Unix, python, and basics of HTML before the semester begins. You’ll also need to know a bit of R markdown and LaTeX, but that’s easier to learn quickly.
  • R - check out swirl! Go through as many modules as possible. I wish I had done this earlier, and become more fluent before class started.
  • SQL, HTML, and Python- check out w3schools, and do the first few lessons.
  • LaTeX - get an account on overleaf.com. It’s the most user friendly and easy to share. You don’t have to worry about a fancy compiler or IDE.
  • git - you will need to understand git and github, and learn to use it from the terminal.
  • Unix - this is the language of the terminal or bash shell aka command line prompt. This is very useful to understand and will be necessary in the STAT 243 assignments.

Probability and Statistics:

  • If you don’t have a strong probability and statistics background, go through the book Mathematical Data Analysis by John Rice (a quick google search will yield a free pdf version). It’s what the Berkeley upper division classes use to prepare students for 201A and 201B. The classes are called “Stats 134 - Probability” and “Stats 135 - Statistics”. If you can do most of the problems in John Rice, you’ll be in a good position to begin 201A and 201B. In my case, I had only taken a single statistics class which didn’t cover much of what was in John Rice, so the learning curve was incredibly steep. I spent most of my time studying undergrad textbooks to catch up to what was going on in class.

Linear Algebra:

Coming into the program, I had several semesters of abstract algebra and linear algebra under my belt from undergrad as a math major, and it was a huge help. As time has gone on in the program, it has become more and more valuable. Students who are not strong in linear algebra hated the end of Stats 243, where it was very heavy on linear algebra, and are completely lost in Stats 230 - linear models. If you have the bandwidth, consider auditing a lower division class, check out David Lay’s undergraduate level book (Math 54 at Cal), or Stephen Friedberg’s upper division level linear algebra text.

Coursework:

Fall Semester:

Stats 243 - Statistical Computing

  • Taught by Professor Chris Paciorek
  • Focuses on using R, unix, and (more minimally)SQL to carry out statistical computing.

Stats 201A - Probability

  • Taught by Professor Aditya Guntaboyina
  • Here’s a pdf of the lecture notes from 2018: Link to Document
  • Every part of those lecture notes are necessary for the second semester coursework you do. Any part that is unclear will be something you have to re-learn in the second semester.

Stats 201B - Statistics

  • Taught by Professor Haiyan Huang
  • Follows material similar to Professor Larry Wasserman’s “All Of Statistics” textbook. Homework and exam problems are often appropriated directly from Professor Wasserman’s materials.

Spring Semester:

Stats 222 - Capstone project course

  • Taught by Thomas and Libor
  • Textbook: “Elements of Statistical Learning” by Hastie, Tibshirani, Friedman
  • Historically, this class is only offered in the evenings, so RIP your Tuesday and Thursday evenings.
  • In the lectures, a very broad view of different statistical models is presented.
    • I highly recommend making your own schedule for reading relevant resources to understand the material.
    • I haven’t utilized the office hours offered by the professors, but I have colleagues who have reported positive experiences.

Stats 230 - Linear Models

  • Taught by professor Peng Ding
    • Utilizes material from 201A and 201B, as well as a great deal of intermediate to advanced linear algebra content to build a foundation for linear models. An alternative name for the class could be “Regression” or “Methods of Regression”.
    • Peng is extremely knowledgeable and dedicated to this course. I’ve sent several emails to ask more granular linear algebra questions or brief clarifications, and he is very quick to respond.

Qualifying elective of your choice

  • Most students took an elective in the stats department, which is a standard lecture based class with homeworks due every one or two weeks.
  • Alternatively, you can select a project based machine learning course in IEOR, CS, INFO, or other qualifying department.
  • Note: If your units don’t add to 12 (because the project classes often are only 3 units, while Linear models and capstone are 4 each), you’ll need to add a seminar for one unit. This may be one hour per week where a guest lecturer will come and talk about a specific area of their research. Most of the content will go straight over your head, but it’s good experience encountering new and novel research. Also, if the lecture is a “job talk” where the guest speaker is hoping to get a position at Cal in the stats department, you might get to watch their research get put on blast by the more senior stats department faculty.

Evans Hall:

  • Your home for the next 2-3 semesters.
  • The third floor has the statistics department, and the 4th floor has the master’s lounge.
  • The elevators are slow, and the main three don’t go to the ground floor; you’ll have to get off at floor 1 and walk down the stairs.
  • Out of the main three elevators, the shortest distance to the lounge after arriving on the fourth floor is as follows: walking out of the east-most elevator, turn right; walking out of the middle elevator or the west elevator: turn left. A rigorous counting of floor tiles was used to make this conclusion, but a formal proof is left as an exercise to the reader. The most inefficient path to the lounge is the one that includes a lengthy discussion in the hallway about which path is optimal.

Comprehensive Exam:

  • The exam was administered on January 25th 2020, and we received our results by email on February 26th.
  • The material was almost identical (or exactly identical) to problem sets, midterm questions, or other practice problems we had seen in the past. It was around the difficulty of a medium-range homework problem, with nothing much easier or much more difficult.

GSI Appointments:

  • GSI appointments come in different time commitment categories: 25% and 50%, corresponding to roughly 10 hrs per week and 20 hrs per week, respectively.
  • Statistics MA students who are interested in being a GSI are usually offered appointments in the 2nd and 3rd semesters, and if you want to be a GSI in the first semester, you need to get special permission from the stats department.
  • I held two different GSI appointments: Fall 2019 for Math 1B (second year calculus), and Spring 2020 (Stats 135, upper-division statistics).
  • In general, being a GSI means you have 8 hours of in-person work. Four hours of instruction and four office hours, or six hours of instruction and two office hours.
  • To apply, reach out to the proper coordinator for the department you want to be a GSI in, and they will send you a schedule of classes for the upcoming semester and ask you to list your preferences. Then, you will receive an offer and go through the proper paperwork and paper-signing as well as an orientation before beginning your work.
  • Some appointments require a significant amount of work, and others are more relaxed. I was fortunate to have two different appointments with two different professors who, despite having polar opposite personalities, gave a similarly relaxed working schedule, which I appreciated very much in the midst of the challenging course load in the stats MA.
  • Time consuming activities: typing up worksheets, quizzes, exams in LaTeX, and grading on gradescope.

Other Tips:

  • Find a study group early on, and make a routine of collaborating together. Most people in the program need to work with others in order to finish the assignments. This is also important for networking later as you’re looking for jobs and learning about what’s happening in industry.
  • Some of my colleagues went to all the career fairs in Fall semester, but only one or two were able to get a job offer or internship that early. Waiting until second semester is totally reasonable.
  • Find a place to live that is close to Evans, even if it means paying more. Every hour of the day is valuable when you’re trying to keep up with the fast paced learning environment of the program, and commute time can make that even more challenging.

Self-Guided Learning:

Many assignments will require you to learn something new and apply it immediately. For instance, an assignment in Stats 243 early on was to do the following:

  • something called “web-scraping” to obtain data, (requires knowledge of html & CSS)
  • manipulate it in Rstudio using R and python (maybe using SQL like commands to subset / retrieve data)
  • generate plots, charts, and statistical analyses (in R with particular libraries like ggplot2 or matplotlib in python)
  • produce your results in a .Rmd file (using three “languages”: R, R markdown, and LaTeX where necessary)
  • and then push the results to a remote github repository (using command line Unix).

The assignment was due 10 days after it was announced. The description of the task itself was two pages long. For some students, it meant learning up to four or five languages, two or three new programs/applications, and several packages or libraries across those languages. This is part of the learning process. It begins as something that seems entirely unreasonable, and then at the end of the ten days, it all seems obvious and you can’t imagine calling library() before you call install.packages(), and you make fun of each other when you don’t have quotes around string objects or your for loop has no conditional statement causing a runtime error.

Much of the material will be learned on your own, on stack overflow, and through others in the program with more expertise. Immediately several distinct names of my classmates come to mind if I have questions of different categories: stats questions (Fitch), ggplot2 and data visualization questions (Mirella) , LaTeX questions (Kyle), and Linear Algebra (myself!). Practice being a resource and asking for help: that’s what industry is like!

Departmental Info & Communication

You will receive many emails from the department:

  • Job opportunities for PhD students (these do not apply to you).
  • Updates about a seminar entitled “Multivariate extensions of isotonic regression and total variation denoising via entire monotonicity and Hardy-Krause variation”. What? You don’t know any of those words? That’s okay. I don’t either, and I’m almost done with the program.
  • Notices about the scf and how it’s being updated or maintained.
  • The weekly “wind down” from the Phd students & social committee about where the social hangout event of the week will be.

Muting Email Chains

Pro-tip: if you know an email chain will continue, but don’t want to get updates and notifications, gmail supports “mute”. Many times a person in our department will be awarded a grant or have something named after them and a dozen or so people will reply all, which is a wonderful way to celebrate an important accomplishment! But it’s also distracting if you’re trying to work or waiting for more important / relevant emails for group projects or interviews.

Email Chains

FAQ:

When a new admit emails me with questions, I’ll answer them and add them here.

Question: Did you enjoy your programme?

  • I enjoyed the program, yes!

Question: What were elements you enjoyed the most?

  • It was rigorous and provided a significant theoretical backdrop to understand statistical methods. I enjoyed the fast paced learning environment and the comfortability we had with various tools even after a short time.

Question: what did you not like?

  • Because the program is very interdisciplinary, there were elements that were completely out of my league, but reasonable for other students, and vice versa. It would have been nice to have more supplementary materials to bridge the gap. I’ve curated most of the resources on my own or after much research and asking around, which was tiresome.

Question: How much work experience do students have (I have worked in quant finance for three years)? What percentage of students have work experience?

  • On average, I would say about a year of experience is average, with about half of the cohort coming straight out of undergrad.

Question: From the module pages and your websites the modules seem to provide a very deep understanding of fundamental statistical models. Is that true? How much does the degree cover more modern approaches of statistical learning?

  • In 201B with Professor Huang, she mentions many more modern statistical methods and the capstone course gives more modern methods. Together with Professor Ding in Linear Models, you’ll have a pretty solid overview of the timeline up to the present day of how methods were developed and utilized. The answer to this question also heavily depends on your definition of “modern”, since the field is changing rapidly.

Question: How much application is included in the degree? From what I understand from the module page and your website, the lectures are theoretical but the coursework will make you apply the learned models on real world data. How did you experience it?

  • The first semester is almost entirely theoretical, with Stats 243 having applied elements. The second semester is where you do more application and work on projects. I highly recommend doing a project on your own or choosing an ML project that has been done and looking at examples before the second semester. EDA, model selection, fitting, estimation. Even a brief small project to get the hang of it will put so much of second semester in context as you experience it. Becoming more familiar with Python, R, and SQL is an excellent usage of your winter break (see my “programming” section in the “industry” tab).

Question: Could you tell me a bit more about the capstone project? Should I think of it as a master thesis? Could you tell me about 1-2 example projects?

  • To finish the MA, you can choose either a traditional research thesis, or the capstone project. 99% of students (data purely anecdotal) choose the capstone. The capstone together with your comprehensive exam qualify you to receive your degree.
  • There are 10 example projects that Thomas and Libor outline to offer as options, or you can choose your own if your proposal is accepted. One project was about analyzing airbnb rental descriptions using natural language processing to find out if location is truly the only thing that matters, or if there is a “more rentable” set of description features. Another one is Alzheimer’s disease brain imaging analysis to find patterns for early detection. Another is geospatial analysis of taxi cab data from Manhattan to find information about optimal fares or pathing for drivers.

Question: How do do you see the job prospects of graduate? What are roles that students tend to go for? In my case, given my previous work experience, I would not want to join a corporate graduate programme but join as an experienced hire.

  • From what I have seen, some students go directly into a full data-scientist position or into a statistical consulting capacity, while others start as entry level data analysts, or apply to internships rather than full time positions. Your post graduation path is dictated mostly by your own job search abilities, connections, and prior work experience. Many of my classmates have several years of experience, and so also won’t be applying to entry level roles.

Question: When you say that the program is very interdisciplinary, do you mean in terms of background of students taking it? I am asking because from my understanding, the degree sounds fairly focused on statistics, compared to other Stats/DS programs that have more elements of computer science in it for example.

  • I suppose I mean both that it has students from lots of different backgrounds, and incorporates many different disciplines. It is less CS heavy than other programs, and leaves most of the learning about programming and implementation for you to do on your own.

Question: As far as my third question on “modern” statistical methods is concerned, I tried to ask if the degree is focusing more on fundamental statistics only or will also teach you commonly applied statistical methods in the industry such as dimensionality reduction techniques, ML, NLP, computer vision, etc. I have seen some other graduate programmes (e.g. NYU’s MS in DS programme) that seems a bit more focused on ML applications. Berkeley’s programme is of course a statistics programme and to that extend focuses more on fundamental statistical methods. However, I am trying to understand how much of the application of these in ML is covered. I see that the module Statistical Learning Theory seems to cover these methods but am wondering how much else they are covered (you mentioned 201B and 230A also cover some of these methods). Would you think it is a fair assumption that this programme provides a rigorous understanding of statistics and includes some elements of ML but leaves the application of these for the capstone project and coursework?

  • Your last sentence is accurate, I would say.

Question: Speaking about electives, which elective did you take and which one would you recommend? For example, would you recommend taking the ML focused module from the stats department (241A) or would you take the module offered by the computer science department?

  • Personally, in order, I think that the CS courses will give you a bigger bang for your buck, and then a rigorous class from the stats department, then other departments which have more project based classes can be interesting, but I wouldn’t recommend unless it was similar enough to the field you intend to enter. I think for you it would make sense to find an economics or finance elective with sufficient programming or stats to qualify.

Lastly, I am also holding offers from NYU, Imperial College London and some other European programs. From your insights into the field, how would you say Berkeley ranks up against NYU and Imperial? From my understanding Berkeley is really top notch in the field of statistics. However, I wanted to ask what you view is and how you see NYU and Imperial in the field of statistics/ml/ds.

  • I can’t speak to comparing your different offers, simply because there are too many unknowns. However, I would say that positioning yourself optimally for whatever you plan to do afterwards should be your number one priority. I’m hoping to work and live in the bay after I graduate, so Berkeley was a natural fit, as well as being the rank 2 university for Statistics. If finance is your main goal, it may be less convenient to build your network in Berkeley for 2-3 semesters, but if you’re single (unmarried) and mobile, it will be less of a strain, and one year goes very quickly, if you choose to do 2 semesters. You have a lot of resources, and the networking is good at a high rank institution, but sometimes the environment or instruction can suffer due to MA students being less of a priority than research, PhD students, etc. But I think this exists in academia more broadly to different degrees.

Question: On the point on modern / ML techniques, do you feel that these are sufficiently covered as part of the programme (lectures+coursework+captstone project) so that you would be able to apply them? My objective of this degree is to a) deepen my understanding of statistic sand b) learn new tools that I have not learned in my econometrics classes / professional experience, such as Neural Networks, NLP, support vector machines, etc. From how you describe the coursework (web scraping + analysis) and the capstone project (airbnb NLP project, Alzheimer computer vision project), it sounds like you do learn these tools as part of the degree. Do you see it the same way? Do you feel there are some things you did not learn (compared to your expectations or compared to a DS degree?)

  • Yes, I think they are sufficiently covered. Additionally, the second semester is very much tailorable to fit what you want to get out of it, so I’m sure you can pick a method or framework you want to improve in or specialize with and incorporate it to your coursework. There are some things I didn’t learn, but I could have learned any of the things you mentioned; I just chose other things, and I didn’t fill up my schedule quite to the brim; I worked throughout the program, GSI’d and also I train competitively in Olympic Weightlifting, along with being married, so the program isn’t the sole focus for me like it is for most students.

Question: Lastly, you mentioned studying for 3 semesters. Is this common? It does of course sound nice to spend 3 semesters instead of 2 semesters studying, however this would mean a 50% increase in (already high) cost of the degree…

  • There are a handful (maybe a dozen or more) students who do the third semester. I’m contemplating it myself in the current status of things, but leaning toward finishing in May. I think for you a third semester wouldn’t make much sense, since you have experience and know what you want to do. However, for students who want a more comprehensive study of certain ML methods, or want to spread out their coursework in order to be more thorough, the third semester can be a strategic means of maximizing the value of their time at Berkeley.

Languages and Resources

SQL

Learn

Beginner:

w3schools SQL Tutorial - I worked through the first section up to “union” in about 1.5 hours.

  • In every example, go to “try it for yourself” and delete the code in the box and retype it from memory based only on the description. Following and agreeing with code is different than producing it.
  • If you don’t know an answer to an exercise, don’t click “show answer”. Google and phrase your question so that you can find the answer quickly; this is what you do in an actual job.

SQL Teaching - The easiest tutorial to learn SQL- an excellent resource that’s a bit more free form (you have to code everything yourself unlike w3schools). I recommend doing it after w3schools if you have no experience, or try it first if you have already been exposed to SQL. (Thanks Kyle G for this resource!)

Intermediate:

Tutorial_Databases - a tutorial on databases from the Berkeley SCF (statistical computing facility).

Style Guide

Style guide by Simon Holywell

Practice

Leetcode has a specific section for database questions: Leetcode Database Questions

  • On Leetcode, I generally work from easy to hard in terms of difficulty, and then randomize and cover the difficulty column, because in an interview they won’t tell you if it’s a trick question or the difficulty level.

Projects

I hope to add some projects here with accompanying data & project steps.

Python

Set-Up:

  • You can use Jupyter Notebooks, Google Colab, or Pycharm. I’m still learning Pycharm, but have had success with the first two.

Learn

  • Python for Data Analysis, 2nd edition: simply google search for a pdf. Chapters 4-6 are particularly useful if you’re unfamiliar with dataframes and the strange ways pandas will behave; it’s helpful to build intuition on this way.
  • Numpy Section - w3schools - numpy is a useful library in python, it’s important to be familiar with the different commands!
  • Machine Learning Section - w3schools - I really wish I had read the last few modules of this section before I had started all the statistics & probability… I had no idea what any of the theory was supposed to be applied to.
  • Plotly:
  • Matplotlib & Seaborn tutorial - Introduces several very useful ways to employ the sns and plt libraries.
  • Reddit Permalink Comment - looks like some great resources. I’ll explore and expand each one if I have a good experience.
  • Random Forests:
    • Changing Categorical Data in Python - one part of implementing algorithms is dealing with categorical data and changing it from categories like “AAA” and “BBB” to numbers. This article discusses approaches to that end.

Practice

Open University Learning Analytics Dataset (OULAD) - the OULAD page has a database schema, a data dictionary, and several examples of analysis from others. Can you recreate their work or understand what they have done? Can you explain all of the variables in the data? This data is in the sweet spot of not too complicated but also multi-faceted to work with and practice with.

Projects

I hope to add some projects here with accompanying data & project steps.

R

Learn

Practical:

  • Swirl is my favorite learning device for R.

Textbook / Theory:

Style Guides:

Practice

  • Find any graph on any article where a dataset is referenced, and try to recreate the graph using the dataset.

Projects

  • If some reasonably stated projects come up, I’ll put them here. Otherwise, swirl should have some more advanced modules that are sufficient in this regard.

LaTeX

Learn

Jake’s notes:

  • Okay, now that I’ve spent the last two weeks back and forth between .tex files, I feel a lot more confident. A great place to practice is Overleaf.com, where you can have the pdf rendered side by side with your tex file on the left. If you have a project, it’s best to add all of the associated pictures, figures, documents, and anything else you need uploaded ahead of time, and then write and include figures as necessary.
  • Apparently, it’s convention for all figures to go at the end of the document, so if you use \begin{figure}, then it will shove it to the end of the document. If you need a graphic inline you can use \includegraphic{} but you can’t conveniently add caption and reference it in your document.

Style Guides:

  • There doesn’t seem to be a particularly prominent style guide, but here’s a Stack Exchange Q&A with some excellent information.

Practice & Projects

  • Pick any math, stats, or science class, or textbook and try to recreate a page of it in LaTeX, googling when necessary. You’ll learn a lot in the process. If your assignment is sufficiently complex or interdisciplinary, it may be all you need to become competent in LaTeX for your future work.
  • Make sure you include citations, and try using a .bib file.

Unix

  • Try my favorite unix command: “nc towel.blinkenlights.nl 23”

Learn

  • Any result from “learn unix” on Google will be great. I checked the first five and all have relevant info that’s exactly what I use all the time.
  • Make sure to learn regular expressions. If you can remember and be fluent in them, it will be extremely beneficial for data cleaning and manipulation. If you can’t manage to remember all the rules, it’s okay. Just understanding what the symbols mean and being able to google your problem effectively are roughly equivalent to being able to do it without assistance.

Practice & Projects

  • For practice, when you need to move folders or make a new folder, or search on your computer, edit documents, view documents, search in documents, etc., try to do it from the terminal without looking at any graphical interface. This will help you get more comfortable without the icons and windows helping you every step of the way. When you ssh into a remote server, you don’t have an interface anyway, so it’s good to be less dependent on the GUI.

GitHub

Section currently under construction / largely empty. Send me resources if you know of them!

Learn

Style Guides:

Practice & Projects