- About the Open Edition
- What’s New in the 3rd Edition?
- Update History
- About the Open Edition
- What’s New in the 3rd Edition?
- Update History
- Saved searches
- Use saved searches to filter your results more quickly
- License
- KAUST-Academy/python-for-data-analysis
- Name already in use
- Sign In Required
- Launching GitHub Desktop
- Launching GitHub Desktop
- Launching Xcode
- Launching Visual Studio Code
- Latest commit
- Git stats
- Files
- README.md
- About
About the Open Edition
The 3rd edition of Python for Data Analysis is now available as an “Open Access” HTML version on this site https://wesmckinney.com/book in addition to the usual print and e-book formats. This edition was initially published in August 2022 and will have errata fixed periodically over the coming months and years. If you encounter any errata, please report them here.
In general, the content from this website may not be copied or reproduced. The code examples are MIT-licensed and can be found on GitHub or Gitee along with the supporting datasets.
If you find the online edition of the book useful, please consider ordering a paper copy or a DRM-free eBook (in PDF and EPUB formats) to support the author.
This web version of the book was created with the Quarto publishing system.
What’s New in the 3rd Edition?
The book has been updated for pandas 1.4.0 and Python 3.10. The changes between the 2nd and 3rd editions are focused on bringing the content up-to-date with changes in pandas since 2017.
Update History
This website will be updated periodically as new early release content becomes available, and post-publication for errata fixes.
- April 12, 2023: Update to pandas 2.0.0 and fix some code examples.
- October 19, 2022: Fix a table link and add eBooks.com links.
- September 20, 2022: Website update after final publication including a couple of minor errata fixes.
- July 22, 2022: Incorporate copy-editing and other improvements for “QC1” stage of production en route to publication in print later this summer.
- May 18, 2022: Update open access edition with all chapters. Include edits from technical review feedback (thank you!), acknowledgements for the third edition, and other preparation to make the book ready for production on its way to print later in 2022.
- February 13, 2022: Update open access edition with chapters 7 through 10.
- January 23, 2022: First open access edition with chapters 1 through 6.
About the Open Edition
The 3rd edition of Python for Data Analysis is now available as an “Open Access” HTML version on this site https://wesmckinney.com/book in addition to the usual print and e-book formats. This edition was initially published in August 2022 and will have errata fixed periodically over the coming months and years. If you encounter any errata, please report them here.
In general, the content from this website may not be copied or reproduced. The code examples are MIT-licensed and can be found on GitHub or Gitee along with the supporting datasets.
If you find the online edition of the book useful, please consider ordering a paper copy or a DRM-free eBook (in PDF and EPUB formats) to support the author.
This web version of the book was created with the Quarto publishing system.
What’s New in the 3rd Edition?
The book has been updated for pandas 1.4.0 and Python 3.10. The changes between the 2nd and 3rd editions are focused on bringing the content up-to-date with changes in pandas since 2017.
Update History
This website will be updated periodically as new early release content becomes available, and post-publication for errata fixes.
- April 12, 2023: Update to pandas 2.0.0 and fix some code examples.
- October 19, 2022: Fix a table link and add eBooks.com links.
- September 20, 2022: Website update after final publication including a couple of minor errata fixes.
- July 22, 2022: Incorporate copy-editing and other improvements for “QC1” stage of production en route to publication in print later this summer.
- May 18, 2022: Update open access edition with all chapters. Include edits from technical review feedback (thank you!), acknowledgements for the third edition, and other preparation to make the book ready for production on its way to print later in 2022.
- February 13, 2022: Update open access edition with chapters 7 through 10.
- January 23, 2022: First open access edition with chapters 1 through 6.
Saved searches
Use saved searches to filter your results more quickly
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.
Materials and IPython notebooks for «Python for Data Analysis» by Wes McKinney, published by O’Reilly Media
License
KAUST-Academy/python-for-data-analysis
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Git stats
Files
Failed to load latest commit information.
README.md
Course materials for a multi day course on data analysis with Python using Pandas based on materials from «Python for Data Analysis, 3rd Edition» by Wes McKinney, published by O’Reilly Media. Book content including updates and errata fixes can be found for free on the author’s website and is available for sale on Amazon.
The objective of this course is to provide students with an experimental approach, through practical experience, with data analysis using the Python programming language. The course is designed to provide students with practical experience with state-of-the-art data analysis tools that are widely used in industry.
This covers will cover the majority of Python for Data Analysis by Wes McKinney. On completion of this course students should be able to:
- Recognize and select data types used in Python for data analysis;
- Understand how to prepare data for further analysis using Pandas, Matplotlib, and Seaborn libraries;
- Understand and apply data modelling and analysis workflows in Python;
- Apply Python for real-world data analysis problems.
A whirlwind tutorial of the basics of the Python programming language. There module also covers a bit of IPython and Jupyter related topics sufficient to make learners comfortable with the programming environment prior to tackling the more advanced material presented in later modules. This material should be shared with students prior to the start of the course to review.
Tutorial | Open in Google Colab | Open in Kaggle |
---|---|---|
Python Language Basics | ||
Built-in Data Structures, Functions, and Files |
After completing this module learners should understand various data types used in data analysis in Python such as NumPy arrays, Pandas Series, and Pandas DataFrames. Learners should also be able to read (write) data from (to) storage in various formats using Pandas.
Tutorial | Open in Google Colab | Open in Kaggle |
---|---|---|
NumPy Basics | ||
Advanced NumPy (optional) | ||
Pandas Basics | ||
Data Loading, Storage, and File Formats |
After completing this module, learners should understand how to prepare (i.e., clean, manipulate, aggregate, and visualize) data for further analysis. Learners will develop a knowledge of the Pandas API as well as a basic knowledge of plotting and visualizing of data with Matplotlib and Seaborn.
Tutorial | Open in Google Colab | Open in Kaggle |
---|---|---|
Data Cleaning and Preparation | ||
Data Wrangling | ||
Plotting and Visualization | ||
Data Aggregation and Group Operations | ||
Time Series |
After competing this module learners will understand how to develop basic data modelling and analysis pipelines using Patsy, Statsmodels and Scikit-Learn.
Tutorial | Open in Google Colab | Open in Kaggle |
---|---|---|
Defining Data Models using Patsy | ||
Statistics Approach to Data Modeling with Statsmodels | ||
Machine Learning Approach to Data Modeling with Scikit-Learn |
Finally, learners will also have an opportunity to apply the skills that they have learned to analyze real data. Typically, instructors should select 3 of the following projects to cover over one day of instruction.
Tutorial | Open in Google Colab | Open in Kaggle |
---|---|---|
Data Analysis Example: Bitly Data from USA.gov | ||
Data Analysis Example: MovieLens 1M | ||
Data Analysis Example: US Baby Names | ||
Data Analysis Example: USDA Food Database | ||
Data Analysis Example: 2012 Federal Election Commission |
To get the most out of this material learners should have completed Python Crash Course prior to attempting this course (but this is not a strict prerequesite).
Instructors have a few options for teaching the material.
- Have the book open on an iPad (or similar); have the students open a new blank notebook; live code some (or all) the examples from the book and use the text of the book as speaking notes.
- Have the students open the book in their browser; have students open a blank notebook in another browser window and the have them read through relevant chapters of the book and code up the examples. Lead instructor and any teaching assistants are available to troubleshoot and answer individual questions. Common questions should be answered to the group as a live demo.
- Have the students open the book in their browser; have students open the provided notebooks in another browser window and the have them read through relevant chapters of the book and execute the provided code. Lead instructor and any teaching assistants are available to troubleshoot and answer individual questions. Common questions should be answered to the group as a live demo.
- Some combination of the above.
Approach 1 is the most difficult for the lead instructor but likely the most engaging for learners; option 3 is easier for both lead instructor and the students but likely results in the least learning. Option 2 is a middle ground: easier for the lead instructor but still requires students to write their own code.
The code in this repository, including all code samples in the notebooks listed above, is released under the MIT license. Read more at the Open Source Initiative.
About
Materials and IPython notebooks for «Python for Data Analysis» by Wes McKinney, published by O’Reilly Media