Databases in Production
- Forword -
- Practices -
- Well thought out table schemas
- Production, Staging/Dev databases
- Database Change Management / Schema Migration tool
- Connection Pooling & PgBouncer
- Database Replicas
- Database APIs
- Views & Materialized Views
- Server Parameters
- Vaccuming, Autovaccuming & Analyse
What is this about
Common struggles around RDBMs when used in production and things I personally would refer to every time I provision something for a prod app
What is this not
A discussion on what the best database tech is or relational vs non relational databases.
Nothing specific, a very beginner article, a basic high level understanding of databases is enough. We'll be working with PostgreSQL because that literally is the best thing out there. Ok. I will control my biases. But yeah, also this is somewhat in Python context.
Well thought out table schemas
Before you create the tables and / or APIs over it, think about what the right schemas would be for you.
Think on the lines of:
- Granularity of data
- Field data types
- Index needed
- Primary keys
Production, Staging/Dev databases
Aah common, this is obvious. Well not that obvious to me when I started using databases first, in production. But having separate instances for production and staging/dev is must. You can chose to go ape shit on your database, just make sure to do it in staging.
Database Change Management / Schema Migration tool
To ensure that your database changes are consistent across environments & easy to be reproduced - one would use such tool. Its easy to make changes to test databases, finalize it on production databases and is just good software practice.
Why this though? Well what is the alternative? You running an alter command directly on the database? And perhaps one of the days having your own story of how you messed up the prod database? I speak from personal experience. Obviously.
Django does this out of the box, while there are premium tools like flyway, alembic. I use alembic.
Alembic provides for the creation, management, and invocation of change management scripts for a relational database, using SQLAlchemy as the underlying engine. It literally autogenerates scripts for migration and is super light weight.
I am very content with this offering since my client codebase is almost always Python & SQLAlchemy.
Connection Pooling & PgBouncer
Every database query makes use of a connection to connect to the database. Connections also creates an overhead causing database performance to degrade.
Using a connection pool can help reduce the overhead of opening and closing database connections by sharing a pool of open connections among all clients. When a client wants to perform a CRUD operation, it can borrow a connection from the pool and return it when it's finished. This way, the same connection can be used over and over again instead of creating a new connection each time.
Pooling helps you not just avoid the connection handling overhead and improve performance but also helps manage traffic spikes and maintain the bottleneck at the database.
Connection pooling can be done at the client end (SQLAlchemy provides it out of the box if you are using Python) or as an external service - PgBouncer.
Also to put it bluntly - irrespective of what you are using for connecting to the databases, at some point in your software lifecycle, you might reach a point where you have maxed out of possible connections to the databases (either idle or active) and you can no longer make new connections. PgBouncer cleans up the connection mess.
Basic monitoring & alerts on top of Memory, CPU, Storage & Connections usually helps avoid potential issues. Sometimes all you need to understand why your app is slow is simply look at the database metrics. And often times, if you are lucky, it will be the database.
pg_stat_activity is a view provided in postgresql for debugging, analysing & troubleshooting your active queries. It contains following information (and more) -
- process id - backend process' id
- usename - user who ran the query
- query - the query itself
- query_start - when the query started executing
- state - state of the query
Here's an excellent gist on GitHub helping navigate pg_stat_activity legendary gist on query handling in psql
While one could debate the actual use of pg_stat_activity for actual diagnostics, I often use this table to clean up the active and idle connections in the database by terminating these processes. Sometimes queries don't run. And they are stuck. And often times, we might not care about and simply want to clean up the memory / CPU / connections for other important queries to get handled.
Storing copies of your database on different machines. It helps with following points (if not more)
- Faster - Reads can be closer to where the end user is. Low latencies.
- Durable - Backups
While there are cons as well - managing replicas, extra machines, how to deal with lags.
If you are using managed offerings like Azure Database for Postgres, management is taken care out of the box.
An anecdote from work - for a fucking week, we could not figure out why our application had high timeouts and thus failures. We knew that it was related to database as that was evident from our monitoring & tracing setup. But never got around understanding what might be the case to why certain times queries just didn't work within an acceptable time (~1s when the same worked within 90ms most times). Obviously we thought it was something wrong with our database provider, Azure, and reached out to them. And after a week of back and forth with their database & networking team realized that our issue was simply we making high amount of calls over the public network.
Issue was - our app in a different K8s in a different network was making calls to the database in a different network. When we forced these calls to be via private network by VNET peering it worked.
When your database is used by multiple services or codebase, it becomes quite difficult to track its usage & effect it might have on the performance of the DB. One of such services / app could impact the performance of the other service / app by running heavy queries that might be hoarding CPU etc.
In such cases, it often becomes desirable to have single point of contact for the Db. Single (API) layer on top of your database that forces all the apps that wants to query the database, to query it via this API layer.
Although it is debatable whether such architecture does more harm than good, there are other pros you'd have with it. Authentication, schema management, consistency all could be well within control if there's only one thing you'd have to actively maintain.
Example of such use case - let's say there's a table in a database, constantly undergoing schema changes. You have multiple teams & multiple apps under that team connecting to this database and updating its contents, adding contents. What ends up happening is codebases making changes / additions to database by different teams with different context and at times different languages & frameworks.
- How do you handle schema changes that were done on the db for a different app?
- How do you yourself communicate schema changes you want to introduce for your service to the db?
- How do you track what is choking the db? And how the db might be (ab)used
Views & Materialized Views
Views are logical copy of your tables constructed or stitched together via joins etc, a select query. Although the results are fetched fresh from its tables underneath when you query the view. In other words, quering a view is as good as quering a SQL query where you don't need to worry the exact nature of the query.
Materialized views on the other hand actually contains the data.
There are some crucial differences between them
- Views - SQL queries, cost effective, fresh data everytime you query
- Materialized Views - Actual data, storage cost, data needs to be refreshed, faster than querying a typical view
I was in college when I had first come across materialized views. After 4 years (2 years of office included) I realized its true potential.
Both views & materialized views has its own uses and the differences amongst them help guide what is needed for what usecase.
Well, put an index on your fields for peak performance. You need to query the db on certain fields - put an index. You make joins on certain fields in your tables - put an index. Want to do full text search on your text fields - put an index.
Although FYI - too many indices come with cons, such as a record addition would need updating the indices and thus making it a slow operation.
Just don't go overboard, ok?
Postgres Servers come with shitload of levers for peak performance, desirable behaviour.
If you are using something managed (Azure Database for Postgres), chances are the defaults are optimized already for you.
Server parameters help you change the behaviour of queries, logging, vaccumming, connections. Use when required.
Vaccuming, Autovaccuming & Analyse
I'll be honest. Out of all the above tricks and practices, Vaccuming is something I came to know of the last. Vaccuming helps you clean your database. That is it. You reclaim your storage space by deleting obsolete data by physically deleting it.
By default - auto vaccuming might be enabled for you, in that case you won't need to think about this actively.
Analyse is another such tool / command for your database to speed up your query plans.
Here's couple of articles on Vaccuming that I found good https://linuxhint.com/postgres-vacuum/, https://severalnines.com/blog/overview-vacuum-processing-postgresq~~l/
PS - I would like to call out people who I have frequently relied on databases for - @abhijith.c, @mike.mccarthy, @vinay.khetan, @shreyas.ranganatha, @shivjeet.bhosale
Abhijith & I did drop a production table once.
Follow me on Twitter for my latest thoughts on things in a typical Larry David style format