Database Fundamentals

Table of Contents

Following topics will be covered:

  • Relational vs. Non-relational databases
  • ACID properties and transactions
  • Indexing strategies and query optimization
  • Connection pooling and management
  • Basic caching with in-memory stores

Important This will be a high level overview of these systems, and covering the essential parts.

Briefing;

In the previous topic, we touched on relational databases and SQL. However, not all data fits neatly into rows and columns. As systems scale or requirements change, developers often look toward NoSQL (Not Only SQL) databases.

Relational databases, such as PostgreSQL and MySQL, organize data into tables with strict schemas. They enforce relationships using foreign keys. Relational databases are ideal when your data structure is predictable and relationships between entities are complex, like a user having many orders that each have many products. They are also well-suited for applications where data integrity is essential, like financial systems or inventory management.

NoSQL databases, on the other hand, are built for flexibility and scale. They generally fall into four categories:

  • Document stores (MongoDB): Store data in JSON-like documents. Great for rapid development where schemas evolve quickly.
  • Key-Value stores (Redis, DynamoDB): Extremely fast, simple pairs. Ideal for session management or shopping carts.
  • Wide-column stores (Cassandra): Optimized for massive write throughput and time-series data.
  • Graph databases (Neo4j): Designed to map complex relationships, like social networks or recommendation engines.

NoSQL databases often favor horizontal scaling (adding more servers), over increasing the power of a single server that is vertical scaling. This makes them well-suited for large, distributed workloads where consistency doesn't need to be absolute. When deciding between the two, consider your application's needs. If it relies on complex transactions and consistent data structures, a Relational database is a good choice. On the other hand, if you're working with unstructured data, need to make frequent changes to your schema, or require a system that can scale horizontally to handle a large volume of data, a NoSQL database is probably the better option.

ACID Properties

When dealing with critical data, partial updates are unacceptable. It guarantee that database transactions are processed reliably. For example, if a bank transfers $50 from Account A to Account B, the system must complete the transaction. If it deducts $50 from A but crashes before adding it to B, the money will be lost.

Transaction is a single logical unit of work, consisting of one or more SQL operations.

ACID stands for: Atomicity, Consistency, Isolation, and Durability.

  • Atomicity: It ensures that either all operations in a transaction succeed or none do. This rule prevents the database from leaving data in a suspended state.
  • Consistency: It ensures, states in a database are always valid. If a rule prohibits negative account balances, a transaction that violates this rule will be aborted.
  • Isolation: It allows concurrent transactions to occur independently without interfering with each other. This is important in scenarios like two people trying to withdraw the last $50 from a joint account at the same time, where isolation ensures only one transaction succeeds.
  • Durability: After writing the data, a power failure or the server crash would not result in data loss, the data will be recovered when the system comes back online.

Relational databases, such as PostgreSQL and MySQL, follow ACID properties strictly. In contrast, many NoSQL databases compromise on some of these guarantees, specifically consistency and isolation, to achieve higher availability and performance, as described by the CAP theorem.

Indexing Strategies

As tables increase in size from thousands to millions of rows, query performance slows down. The database has to read every row to find a match if there's no index, which is similar to reading a 500-page book to find one specific word. A database index is a separate data structure, often a B-Tree, that helps the database find data quickly. It works much like the index at the back of a book, enabling fast lookup without having to scan every row.

Common indexing strategies:

  • Primary Key Index: Automatically created for the primary key. Rows are physically sorted by this key.
  • Secondary Index: Created on frequently searched columns, like email or status.
  • Composite Index: An index on multiple columns. If you frequently query WHERE last_name = 'Smith' AND first_name = 'John', a composite index on (last_name, first_name) is highly efficient.

They speed up read operations, such as SELECT, but slow down write operations, including INSERT, UPDATE, and DELETE, because the database has to update the index structure every time the data changes. Indexes also take up disk space. The goal of query optimization is to find the right balance, indexing the columns that are queried most often, while avoiding over-indexing tables with heavy write activity.

Connection Pooling

Establishing a connection to a database is expensive, involving network handshakes, authentication, and memory allocation on both the application and database servers. If an application opens a new connection for every HTTP request, it will quickly lead to a bottleneck and exhaust database resources. Connection pooling is a solution to this problem. The application creates a pool of ready-to-use connections at startup, rather than opening and closing connections as needed. When the application needs to communicate with the database, it borrows a connection from the pool, uses it, and then returns it. The pool manages the connections, ensuring they are healthy and reused safely.

Key connection pool settings to be aware of:

  • Minimum/Maximum Connections: Sets the baseline and limits to prevent the database from being overwhelmed.
  • Connection Timeout: How long the app will wait for a connection from the pool before giving up and throwing an error.
  • Idle Timeout: How long an unused connection can sit in the pool before being closed to save resources.

Basic Caching

Even with optimized queries and connection pooling, hitting a disk-based database for every request adds latency. Caching is the most effective way to boost performance for read-heavy workloads. A cache, such as Redis or Memcached, is a high-speed, in-memory data store that sits between your application and the database. It stores data in RAM, which makes retrieving data from a cache much faster than reading from a disk.

The most common caching pattern is Cache-Aside (Lazy Loading):

  1. Application receives a request for data (e.g., "Get user profile for ID 42").
  2. Application checks the cache. If the data is there (a cache hit), it returns it immediately. The database is never touched.
  3. If the data is not in the cache (a cache miss), the application queries the database.
  4. The database returns the data, and the application writes a copy of it into the cache with an expiration time (TTL - Time To Live).
  5. The application returns the data to the user.

Important "There are only two hard things in Computer Science: cache invalidation and naming things." - Phil Karlton

Caches are volatile, which means that if you update a user's email in the database, the old email may still be stored in the cache. This outdated information will remain there until the time to live (TTL) expires. To ensure consistency, you need to implement strategies such as deleting or updating the cache whenever the database is modified. This approach does add complexity to the system.

Representational Diagram

When a client requests data in a well-designed system, these database fundamentals work together seamlessly:

Dataflow in DB Index

The Application checks the Cache first to avoid hitting the disk. If the cache is empty, it borrows a connection from the Connection Pool, which saves time and resources. The Database executes the query efficiently using Indexes, so it doesn't have to scan the whole table. The ACID properties guarantee that data remains consistent, especially if the request is an update. The app then returns the data and updates the cache for future requests.

← Function Approximation as Supervised Learning Fundamentals of Natural Language Processing →