Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Below are the top discussions from Reddit that mention this Amazon book.
Books Computers & Technology Databases & Big Data
Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords? In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications. Peer under the hood of the systems you already use, and learn how to use and operate them more effectively Make informed decisions by identifying the strengths and weaknesses of different tools Navigate the trade-offs around consistency, scalability, fault tolerance, and complexity Understand the distributed systems research upon which modern databases are built Peek behind the scenes of major online services, and learn from their architectures
Reddazon may receive an affiliate commission if you make purchases on Amazon.com through this site. Thank you for using these links to support Reddazon.
Martin Kleppmann
Reddit Posts and Comments
0 posts • 37 mentions • top 36 shown below
2 points • wgljr
SQL scales perfectly fine. Pick up, and read through, Designing Data-Intensive Applications by Martin Kleppmann. He answers your question within the first few chapters. It’s a really detailed and through book that gets deep into databases, scalability, and how to organize a distributed system that can handle large amounts of traffic.
2 points • elus
Anyone interested in the above should read Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppman
2 points • MrFourSeasons
Right now I’m supplementing my university courses with DataCamps DE path, this book, and considering doing the DE ND on Udacity. Is there any resources you’d recommend?
8 points • ds_neu_throwaway
Senior data science major so I've taken all of these classes lmao let's gooooo
CS4100: Summer 1 2018 with Kevin Gold, the legend himself. It's a pretty good place to start for most CS majors since you're only expected to know Java. Much of the course/assignment was taught around games, which is a pretty approachable place to start for many of the course's big ideas. The other side of that coin is that because it doesn't expect much in terms of mathematical background you really only scratch the surface of the material. This also means there's not a ton of overlap with data science proper since it requires heavy statistics. A lot of people seemed to ignore it but my favorite part of this course was learning about the history of AI. It comes up surprisingly often and if you find yourself talking to researchers or reading papers it's stuff they generally assume you know.
Six homeworks total, each a programming assignment in Java: A*, minimax, Bayesian inference, decision trees, Q-learning, expectation maximization. They're all pretty straightforward, and I remember doing each of them the night they were due no problem.
TL;DR: easy elective and a good starting point if you don't know a lot about the whole computers that learn thing.
CS4120: Spring 2020 with Lu Wang. She's a very good professor: does cool research in this area, expects a lot out of her students but very kind and helpful. The key thing to know is that NLP is a very big, very advanced field that changes rapidly with each passing year. It helps to have some understanding of what state-of-the-art techniques can accomplish to motivate the fundamental material. This course is also typically taught to incoming grad students more often than it is to undergrads. So while the prof understands that you may not have as strong a mathematical background, you might be playing catch-up as the semester goes on.
This course only covers the text part of NLP; if you're looking to learn about speech recognition or audio processing you won't find it here. Quite a bit of ground gets covered in spite of that: basic language modeling with n-grams, tokenization, part-of-speech tagging with Hidden Markov models, constituency and dependency parsing, training word embeddings with CBOW a la word2vec, information retrieval, summarization, and machine translation. It's a lot more material than most courses would cover, but that's to be expected. The goal is to gain enough foundational knowledge to engage with current NLP research, which is no small feat.
Four programming homeworks, each of which is straightforward but a substantial amount of effort. There's also a final project done groups of two or three, meant to showcase research-type work on a problem using the above methods. I think my project was graded easier due to the pandemic, so idk how different the expectations are in a normal semester, but it was still a lot of work.
TL;DR: pretty tough class that tracks closely with a lot of near-current or current research, so it promises a lucrative skillset. Probably not good for someone just looking for an elective unless you're particularly motivated.
DS4200: Spring 2019 with David Sprague. (see comments below, he's no longer at NEU so YMMV) This is an in-depth class in a niche subject. Visualization is critical to data mining and pays big bucks, yet still gets ignored among data scientists because it falls closer to UI/UX than it does to shiny stuff like deep learning. I was surprised at how cool this class was because it's this whole huge area of research that you wouldn't even realize existed: there's so much more to visualization than meets the eye. You learn all about human cognition and perception, from abstract things like marks and channels all the way down to how the photo-receptors in the eye work. A lot about color encodings and color maps: using color appropriately is kind of tricky and almost nobody in the real world gets it right. Lots about interactivity: brushing and linking, panning, zooming, drag and drop. User testing comes up frequently, and how the tool has to be grounded in the questions being asked about the data. Edward Tufte and Michelle Borkin came up a lot, along with discussions of ethics and honesty.
My biggest gripe about this class was the assignments: with the exception of a little bit of Tableau, you're expected to learn and use D3 for the class. In and of itself this isn't so bad except a) if you're not already comfortable with JavaScript specifically for the browser/DOM, and b) if the professor doesn't know enough web dev or D3 to teach you. Sprague knew so much about the theory of building effective visualizations but when it came to actually making it appear in the browser it was the blind leading the blind. Then when the final project was to build a visualization web app from scratch in D3 I remember my group all feeling a little bit doomed because he couldn't/didn't do enough to bring the class up to speed. Hopefully that's no longer the case though, and if not I recommend either this very short book or taking a web dev class first.
TL;DR: not a difficult class so long as you have a grasp on JavaScript/D3 or the right resources to learn it. Probably the most interesting and broadly applicable of all.
DS4300: Spring 2020 with John Rachlin. I can't say enough good things about this man: he's very good at teaching, very fair with his classes, and loves encouraging students to take on cool projects. The content itself is a little bit weird though: the way it was taught was almost like a survey of big data engineering technologies. Like, a little bit of distributed SQL, MongoDB, Redis, Scala, Spark, Neo4J and I can't even remember what else got thrown in. You don't learn enough about any one of them to put it on your resume really, but it's a good starting point and it helps to know what's around. The actual theory of distributed data systems comes from a book that amounts to tech Gospel, but its presentation in the course is very condensed. So if you're trying to ace your system design interview you're still gonna have some work to do on your own.
Homeworks are centered around a specific technology, and mostly amount to getting up and running for a simple task. The final project is building a codebase to solve a problem with some dataset that would benefit from NoSQL or alternative modern data technology. Again, pandemic, but I didn't think it was so bad and my group did well.
TL;DR: a nice survey of current big data technologies, but not enough depth to make you a competitive data engineer without further study.
DS4400: Spring 2020 with Ehsan Elhamifar. Now I had a friend who took it with someone else and according to him it was very different. But boy oh boy if you thought Fundies or OOD was tough, this class will put hairs on your chest for sure. He is a hard-ass professor and his presentation of the material is very rigorous. I mean this in the best way possible too: at my first co-op I was very unprepared for the level of mathematical conversation happening between data scientists on my team. I feel like taking this course with Prof. Elhamifar forced me to demonstrate the mathematical maturity I needed to finally feel confident I could "play ball with the big kids" so to speak. I wouldn't recommend taking this unless you're very comfortable with the preliminary math: specifically Prob and Stats, Stats and Stochastic, Linear Algebra, and maybe even Calc 3 though it's not strictly necessary. This is not only because it's tough to teach yourself all that material well enough to keep up, but because the written exams in this class resemble those of college math classes.
Covers matrix calculus, least-squares objective, linear regression, regularization, logistic regression, support vector machines, maximum a posteriori estimation, basis function expansions as a starting point for neural networks. Four homeworks total, each of which consist of a written problem set and a coding portion. The coding part is always pretty straightforward; the math problems are proof-based and challenging even if you understand the lectures. The final project is an application of the algorithms in the class to some dataset (probably from Kaggle) so there's not too much to worry about there.
TL;DR: very much a tough class but crucial for more advanced machine learning. Probably not good for an elective unless you are very motivated. Definitely rule it out if you're not good at math.
1 points • curryeater259
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
1 points • whymauri
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
save yourself the money, time, and grief.
1 points • wololo94
Read books. Start with Designing Data Intensive Applications.
1 points • trabbaro
Strong recommendation: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/
Design is all about scaling. And the big problem with scaling is scaling the data. That book is a great look at how that happens.
1 points • milkeater
Your comment of "you and the rest of the industry would really like to know" says otherwise.
You didn't come here to have a conversation, just to pump your own tires. Whether you believe it or not, you are speaking buzzwords and I can see right through it.
If you genuinely are struggling, read that system primer and practice with this book
You aren't the only one that is struggling in this area but that doesn't mean everyone is. Frankly there is a vendor model of providing services to lift and shift in bulk with large enterprises that experiments in the $100ks. It's one of the easiest and hottest paths I've seen of getting into FAANG or starting your own service.
1 points • NowImAllSet
I don't usually recommend resources that I haven't personally read, but I've heard lots of great things about Designing Data-Intensive Applications by Martin Kleppmann.
1 points • nivenkos
I'd add reading Designing Data-Intensive Applications.
I work as a data engineer at a FAANG, we do the equivalent of a leetcode easy to check for basic programming ability (consideration of computational complexity and trade-offs between implementations), but the main focus is on familiarity with SQL and database trade-offs.
I.e.:
- Can you explain the CAP theorem?
- What are ACID transactions?
- How would you analyse query performance?
- What might you change in the database structure to allow for more performant queries?
- How would you check that this is working?
Also non-technical questions about past experience form a large part of the interview process. So think carefully about projects you have worked on:
- What worked well?
- What didn't work out?
- What have you learnt to do differently?
- When have you pushed for some new improvement or idea and seen it through to delivery?
- When have you faced push-back against your ideas? How did you react?
- When have you been blocked by other teams? How did you resolve this?
1 points • iGoByDuBz
Highly recommend picking this up and working through it Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://www.amazon.com/dp/1449373321/ref=cm_sw_r_cp_api_i_KiknFbMM0SAHS
1 points • sunny_tomato_farm
Read the bible: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://www.amazon.com/dp/1449373321/ref=cm_sw_r_cp_api_i_e2RTEbKJD4CYN
1 points • hagy
Thanks for sharing. Looks interesting so I read through the free sample and after finding that enlightening, I purchased the book. Enjoying reading through it and imagine it will be quite educational.
While many are aware of the book, I'd like to also recommend the new classic, Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. I've read through the book multiple times and learn something each time.
1 points • i_wanna_get_better
For distributed systems in general, you can’t go wrong with Designing Data-Intensive Applications
1 points • NakkiGN
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems https://www.amazon.com/dp/1449373321/ref=cm_sw_r_other_apa_i_qyNKEbT8HZKFJ
1 points • FuncDataEng
I would also highly suggest learning FP. I am a Senior Data Engineer with Amazon and this was the most valuable thing I taught myself when I started working at Amazon. Along with all these suggestions the book I recommend to every new data engineer is https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321 as this book gives you a great starting point to going more in depth on the various architectures you will encounter. The other thing I would say is don’t just learn sql syntax but focus on how database internals work in order to truly understand applying optimization skills. You will often be handed some query/crude pipeline from either Software Engineers or Data Scientists and will need to be able to optimize them to be production ready.
1 points • ccleary00
Designing Data-Intensive Applications is a really good one
1 points • GiorgioPerlasca
In theory yes "keep data near where it is used the most". In practice you have to consider partial failures, network failures, and so on.
To understand it better, there is a great book: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
1 points • realfeeder
I love the book you mentioned(assuming it is this one) but I really don't think this is a good starting point for OP, because, quoting:
> but many topics will be new to me, learning python, Apache products etc.
They said they are learning Python and you just suggested a very comprehensive and advanced book on the topic. I think they would benefit way more by writing and breaking stuff in the beginning of their journey.
And that website is piracy. That is not allowed here. You can read the book legally, for free on Safaribooks online Trial version if you wish.
1 points • exlaximas
Coding language: Python, Java, Scala
Querying languages: SQL
Read up on MapReduce and Spark. Maybe create a Jupyter Notebook to mess around with Spark.
I was recommended to read Designing Data Intensive Applications by some people in this sub and a data engineer at LI.
That should be a good starting point. Having hands on experience working with Big Data might be a bit tough if you're doing this on your own because of the costs to spin up multiple machines can be quite expensive.
From my perspective, I think the best way for someone to learn big data is to get a data engineering job/internship.
Congrats on finishing up university and I wish you the best.
1 points • kobvel
>https://www.amazon.com/dp/1449373321/ref=cm_sw_r_other_apa_i_qyNKEbT8HZKFJ
I would recommend this book not even for big data but in general as a chest of knowledge for any software engineer.
1 points • bradengroom
Designing Data Intensive Applications if you want something that covers a variety of data systems.
SQL Performnce Explained is a great book if you want something more specific to understanding B-Tree indexes in traditional relational databases.
1 points • morpho4444
not a 2020 book but I read it in 2020
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
​
opened my mind about going beyond data warehouses.
1 points • healydorf
Not really. Passive listening is great if you want little "a ha" moments, but not if you need to build a solid understanding of a topic.
That aside I don't know of any "fundamentals" podcasts. There's ones that do really great deep-dives into more specialized topics. Coding Blocks did a really great "listen along" with Designing Data-Intensive Applications.
1 points • simplescalar
A friend of mine who is a software architect for large startups recommended these books
1 points • confusedtaco
Let's all pitch in and buy RH engineering a bunch of copies of https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
Very stupid, very legal.
1 points • maybedota
My 2 cents: You can start by learning Spark, personally its a great framework to learn how distributed data processing / streaming works.
Secondly I recommend this book, even if you dont have interest in field : https://www.amazon.com/Martin-Kleppmann/dp/1449373321/ref=sr_1_1?crid=1XYWI3UFVEW21&dchild=1&keywords=data+intensive+applications&qid=1603452406&sprefix=Data+inten%2Caps%2C212&sr=8-1
Thirdly don't set your goal to be "great", but rather to be "better".
1 points • luthfurc
Check out Martin Kleppmann's Designing Data Intensive Applications: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
2 points • tdstdstds
IMO, one of them is this one: Designing-Data-Intensive-Applications
And as a reference or long term goal, these two:
- https://ce.guilan.ac.ir/images/other/soft/distribdystems.pdf
​
edit: added long term goals / references
2 points • AddMoreAbstraction
Not OP (they may have a different take), but I've gotten the most out of books that cover broad engineering topics, rather than some specific language/framework/etc. Working Effectively with Legacy Code is where I would recommend starting (it's a language agnostic deep dive on how and why we test). Anything written by Martin Fowler is amazing (he gives great talks, too). Designing Data-Intensive Applications is a book I wish I'd found years ago.
A quick Google for 'best programming books' will point you to plenty of discussions. It's hard to go wrong by grabbing the intersections of those lists.
1 points • rantwasp
read: http://highscalability.com/ often
read: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
read: https://www.amazon.com/Introduction-General-Systems-Thinking-Anniversary/dp/0932633498
read the books in the aosa series: https://www.aosabook.org/en/index.html
look into the aws well arhitected framework: https://aws.amazon.com/blogs/apn/the-5-pillars-of-the-aws-well-architected-framework/
actually build stuff and hang around with people that build stuff. ask them why they do things in a certain way - be humble