System Design Notes — Part 1

Concerns that are important in most software design

When designing a large-scale system we need to define the architecture, components, modules, interfaces, and data for a system to satisfy specified requirements. Here are some of the key considerations to keep in mind when designing a system.

Press enter or click to view image in full size

Three important aspects of most system design

Scalability

The ability of the system to handle increasing amounts of load and traffic as it grows in size and complexity. There can be various reasons why a system needs to scale, such as a surge in data volume or an increase in workload, such as the number of transactions. Ideally, a scalable system should be able to scale up without compromising its performance.

Horizontal scaling and vertical scaling are two different approaches to increase the capacity and performance of a system.

Horizontal Scaling

It is also known as scaling out, involves adding more machines or nodes to a system to distribute the load across them. In this approach, each node operates independently and can handle a portion of the overall workload. The benefit of horizontal scaling is that it can be done without a single point of failure, and it can be more cost-effective than vertical scaling as it involves adding commodity hardware. However, this approach requires a system to be designed for distributed computing and may result in increased complexity. E.g.: Cassandra, MongoDB

Disadvantages:

Complexity: Horizontal scaling can be more complex than vertical scaling, especially when dealing with a large number of nodes. This can increase the management and operational costs associated with the system.
Network latency: In a distributed system, communication between nodes can introduce network latency, which can impact system performance. This can be particularly problematic when nodes are geographically dispersed.
Data consistency: Maintaining data consistency can be more challenging in a horizontally scaled system. This is because each node may have its own copy of the data, and ensuring that all nodes have consistent and up-to-date information can be challenging.
Increased communication overhead: As the number of nodes in a system increases, so does the communication overhead between them. This can lead to slower system performance and higher bandwidth usage.
Additional infrastructure: Horizontal scaling often requires additional infrastructure, such as load balancers, to manage traffic between nodes. This can add to the cost and complexity of the system.
Limited scalability in certain cases: Some applications may not be able to scale horizontally, such as those that rely on shared resources or have a high level of inter-node communication.

Vertical Scaling

It is also known as scaling up, involves adding more resources, such as memory, CPU, or storage, to a single machine. In this approach, the system can handle more load by increasing the resources available to it. The benefit of vertical scaling is that it can be simpler to implement, as it does not require complex distributed computing infrastructure. However, this approach may have a single point of failure and can be more expensive, as it often involves using high-end hardware. E.g.: MySQL

Disadvantages

Limited scalability: There is a limit to how much resources can be added to a single machine. Eventually, the cost and feasibility of adding more resources will become impractical, which can limit the scalability of the system.
Single point of failure: Since all resources are added to a single machine, if that machine fails, the entire system may go down. This can result in more downtime and a higher risk of data loss.
Expensive: Vertical scaling often requires using high-end hardware, which can be significantly more expensive than commodity hardware used in horizontal scaling. This can make it harder to justify the costs of scaling up.
Resource wastage: Since all resources are added to a single machine, there may be wastage of resources if not utilized efficiently. For example, if CPU utilization is low, adding more CPU resources will not provide any performance benefit.
Disruption during upgrades: Upgrading a vertically scaled system can be disruptive, as it often requires downtime to add or replace hardware components. This can result in lost revenue and inconvenience to users.

Reliability

The system should be dependable and available to users at all times. It should be designed to handle failures and minimize downtime.

A distributed system is deemed dependable if it continues to provide its services even in the event of failure of one or more of its software or hardware components. Achieving redundancy of both software components and data is how a dependable distributed system ensures reliability.

For example, a banking system must be reliable to ensure that transactions are processed correctly and timely, and customer accounts are accurate. If a customer requests to withdraw money from their account, the banking system must reliably process the transaction, ensuring the correct amount is withdrawn, and the account balance is updated accurately. If the system is not reliable, there could be errors in the transaction or even a loss of funds, which would result in a loss of customer trust and potentially significant financial consequences for the bank.

Reliability is crucial for any system that performs a critical function, and it helps to build trust and confidence in the system’s users.

Maintainability

The system should be easy to maintain and update over time, with minimal impact on users and the overall system.

Serviceability or manageability is the simplicity and speed with which a system can be repaired or maintained; if the time to fix a failed system increases, then availability will decrease. Early detection of faults can decrease/avoid system downtime.

For example in the case of an industrial machine used in a manufacturing plant. If the machine is not maintainable, it may require extensive downtime to repair or upgrade, resulting in lost productivity and revenue for the plant. On the other hand, a maintainable machine can be quickly serviced, and its parts can be easily replaced, ensuring that it remains operational and efficient over its lifespan.

Maintainability is a critical factor in ensuring the longevity and efficiency of a system, and it is essential to minimize downtime, reduce costs, and improve the overall user experience.

While Scalability, Reliability, and Maintainability are crucial aspects of most system designs, there are other concerns that must also be taken into account.

Performance

The system should be designed to meet performance requirements and provide fast response times to users.

Two standard measures of its efficiency are the response time (or latency) that denotes the delay to obtain the first item and the throughput (or bandwidth) which denotes the number of items delivered in each time unit.

Security

The system should be designed to protect against unauthorized access, data breaches, and other security threats.

Usability

The system should be easy to use and navigate for users, with clear instructions and minimal complexity.

Interoperability

The system should be designed to integrate with other systems and platforms, as needed.

Cost

The system should be designed with cost in mind, with a focus on optimizing resources and minimizing expenses.

Legal and regulatory compliance

The system should comply with all relevant legal and regulatory requirements, including data privacy laws and industry-specific regulations.

Ethical considerations

The system should be designed with ethical considerations in mind, including fairness, transparency, and respect for user privacy and autonomy.