Lets Code Them Up!

DDIA | Chapter 1 | Reliable, Scalable, And Maintainable Applications – Scalability

June 24th, 2023
In the last post, we discussed the reliability of data systems, next topic discussed in the book is scalability.

Scalability is the ability of system to maintain performance even with changes in the incoming load .

What is load? Load is the number of requests coming to a system. Generally we use certain parameters to describe load called load parameters. Load parameters differ based on the architecture of the system. Example for a service it is number of requests incoming, for database it is number of read/writes, etc.

Performance parameters/metrics These metrics measure the performance of systems and let us determine what happens in both the cases –
1. When we increase a load parameter but keep the system resources same, how is the performance of the system.
2. When we increase a load parameter, how much resources we need to change to maintain the performance of the system.
Some common performance metrics –
1. Throughput – the number of records system can process per second, or the total time it takes to run a job on a dataset of a certain size.
2. Service Response Time – time between client sending a request and receiving a response. This is the time that client sees, includes networking delays and other delays.
  Generally service response time is measured in percentiles. The response times over a period of time are sorted for slowest to fastest. Median or p50 is the halfway point. p50 is good metric if you want to know how much the users have to wait to get response.
  For finding outliers, p90, p95, p99, p999 are good metrics. p95 -> 95% requests takes less than x seconds and remaining 5% takes more than x seconds.
3. Latency – Latency is the duration that a request is waiting to be handled—during which it is latent, awaiting service.
Some other interesting terms related to performance metrics –
1. head-of-line blocking – The number of requests a system can process small number of things in parallel. So even a small amount of slow requests can hold up the processing.of following requests. This is called head-of-line blocking.
2. tail latency amplification – Say our backend service calls multiple services in parallel to serve the end user request. The end user request will slow down even if one of those parallel calls is slow. This is called tail latency amplification.
Approaches for handling load – Scale up (switching to system with more power) and scale out (distributing load to more systems in parallel).

Some systems are elastic, they add more systems automatically as load increases and remove them once load systems decreases. Other systems are manually managed, means human analyses system and decides to add more resources or remove them.

The decision on which approach to use to manage load to maintain scalability depends on architecture. And it might even change with different increases in load.

Thanks for stopping by! Hope this gives you a brief overview in to scalability aspect of data systems. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
DDIA | Chapter 1 | Reliable, Scalable, And Maintainable Applications – Reliability

June 16th, 2023
Data Intensive applications are the ones which revolve around data. The problems for these applications are amount of data, data complexities, speed at which it changes, maintaining the data and so on.

Building blocks of data intensive applications –
1. Databases – store/retrieve data
2. Caches – remember data to speed up things
3. Search indexes – searching using keywords
4. Stream processing – asynchronous messaging
5. Batch processing – processing chunks of data together
As we can see there are so many different types of systems that handle data. Even within caches, there are different types and so on. Each has their own characteristics, advantages and disadvantages.

So what is a data system?

Imagine we have a database, a message queue and a search index and then an application/API which converses with all of these. The application is hiding the complexity of handling data, making sure data remains consistent between the three systems. This whole combination of different common components along with an application is the data system, it has been designed to serve a unique purpose.

To summarize, data systems are combination of general purpose systems that handle data and an application to provide a special service.

When thinking about such a system, one is not only thinking as an application developer but also as a data engineer. One needs to think which combination of these general purpose tools will give the most reliable, scalable and maintainable data system.

Let’s look at what reliability means when it comes to data systems or systems in general .

The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).
Definition of Reliability from book

We want our systems to be reliable or we can say we want make our system as fault tolerant as possible. Let’s look how we can handle some type of faults.

Hardware Faults

Common hardware faults are hard disk failure, RAM corruption, power failure, and so on.

This is tackled using redundancy at the ground level. Example, generators and batteries for power if the power fails. Disks setup in raid configuration, dual power supplies, dual CPUs. When one system fails, other takes over.

Until we had large cloud based system, this way to handle hardware faults were sufficient. A huge data center typically has large number of hardware faults and so now software based approaches are used to handle these hardware faults.

We generally see planned downtime on utility websites. That is one way to handle hardware faults. Having planned outage to replace the faulty machines.

Software Errors

These are the bugs introduced in systems while they were developed because of certain assumptions and decisions. There is no easy way to prevent such errors.

Thorough testing, load testing, performance testing, process isolation, etc are few ways in which we can. find and fix these errors early.

Human Errors

Humans design systems. Humans are the ones who maintain these systems. Humans can introduce errors at any stage, design, development, release, or maintenance.

Common ways to avoid faults by humans
1. Designing in a way to minimize introduction of errors
2. Testing thoroughly at all levels, unit tests, integration tests.
3. Use metrics to detect faults quickly.
Thanks for stopping by! Hope this gives you a brief overview in to reliability aspect of data systems. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
Feature Engineering

June 10th, 2023
It involves converting raw data from one or more sources into features using domain knowledge and statistics that can be used for training ML models.

A feature engineering lifecycle has three steps –
1. Feature Selection
2. Feature Creation
3. Feature transformation
Feature Selection

Identify required features from training dataset and filter out the redundant and unnecessary ones from the training dataset. It helps reduce feature dimensionality, which produces small dataset and so speeds up ML model training.

One of the techniques used for feature selection is feature importance score. It tells how much a feature contributes to final model with respect to other features. So we can filter out features with feature importance score of 0 and analyze remaining ones based on score whether to keep or filter.

Feature Creation

Combine existing features into new features or combine attributes into new attributes. These new features helps models produce more accurate predictions

Feature transformation

Feature transformation is calculating missing features using techniques like imputation. It also includes scaling numerical values using techniques like standardization and normalization. And converting non numerical features into numerical features so that model can make sense of these non numerical features using techniques like one hot encoding.

Feature Engineering pipeline

Once the feature selection/creation is done, data set is split into training dataset and test data set. Training dataset is further split into training and validation dataset. Validation dataset is used to evaluate model and tune hyperparameters.

Thanks for stopping by! Hope this gives you a brief overview in to feature engineering. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
Statistical Bias – What is it? | What causes it in a ML dataset?

June 4th, 2023
What is Statistical Bias?

A dataset is said to be biased if it cannot completely and accurately represent the underlying problem space.

Tendency of statistics is to either underestimate or overestimate a metric causes bias. When it comes to training data, there could be more values for some fields while very few for other fields. Think in database table, certain columns are sparse while others are dense.

Biased datasets could cause biased models, which in turn could result in biased analysis and affecting business decisions which would also be biased.

Example – Product review dataset. More reviews for product A than for product B and product C. When we would run sentiment analysis on this dataset, our model would be accurately able to predict sentiments towards product A then towards product B or product C. We can see our model is heavily biased towards product A.

Causes of Statistical Bias
1. Activity Bias – Biases exists in human generated content specially social media content. Very small percent of population is on these platforms. So data collected over years on these platforms doesn’t represent the whole population. Let’s consider facebook, there is still lot of people who don’t use it/use it very rarely, any data inferred from the data collected from facebook would be biased towards the opinions/like/dislikes of only that sector of the population.
2. Societal Bias – Biases existing in human generated content in addition to social media content because of preconceived notions and conscious biases. An example can be the bias in existing voice assistants. The voice assistants have been trained only generally with US and british accents of english. Say, person A is in US but for him english is second language and has an accent. He is using voice assistant with his device set to locale, that voice assistant will have difficulty in understanding his words.
3. Selection Bias – Feedback loop. Bias introduced by the ML system itself. For example, a ML system gives user few options to choose from, and then uses user’s selection to train the model. This introduces feedback loops. Recently I ordered a sippy cup for my baby from Amazon, now whenever I open the app, in the section for items that can interest you (system recommended), I see different sippy cups/bottles. Here based on the items I am buying, system has started recommending similar items on me, although that could have been one off purchase for that item from Amazon.
4. Data drift – Happens after model is trained and deployed. Happens when data distribution that model is getting is different from the data distribution model was trained on.
  1. covariant drift – distribution of independent variables or features used by model significantly changes.
  2. prior probability drift – distribution of labels or targeted variables change.
  3. concept drift – relationship between features and labels change.
Thanks for stopping by! Hope this gives you a brief overview in to statistical bias. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
Clean Architecture – Chapter 14 | Component Coupling | SAP | The Stable Abstractions Principle

May 27th, 2023

So far we have discussed The Acyclic Dependency Principle and Stable Dependencies Principle, in this post we will go through Stable Abstractions Principle

A component should be as abstract as it is stable.
Robert Martin

High level architecture/policies are part of software and they should be present in a component that is stable, because we don’t want them to change often. However, if we put these high level policies into a stable component, then the code representing these policies would be difficult to change. And that could make the overall architecture inflexible.

So where should we put these policies so that they are stable as well as flexible. The answer is Open Closed Principle. This principle as we have seen gives us the opportunity to have classes that can be easily extended without modification using Abstract classes.

We can see that SAP creates a relationship between stability and abstractions. It says that a stable component should also be abstract so that it’s stability does not prevent it from being extended. So, for a component to be stable, it should contain interfaces and abstract classes.

SAP + SDP = DIP for components

SDP says dependencies should run in the direction of stability. SAP says stability implies abstraction . So dependencies run in the direction of abstraction.

Measuring Abstraction

Abstractions A = Na / Nc

where Na -> number of abstract classes and interfaces in the component

Nc -> number of classes in the component

A ranges between 0 and 1, 0 means component has no abstractions, 1 means, component has only abstract classes and interfaces.

The Main Sequence

Components on (0, 1) are maximally stable and abstract. Components at (1, 0) are maximally unstable and concrete. Other components lie on the line between these two components depending on the degree of stability and abstraction.

The Zone of Pain

Component on (0, 0) is highly stable and concrete. It is rigid. It cannot be extended as it’s not abstract, It is difficult to change because of its high stability. The area around (0, 0) is called zone of pain.

There are few components that lie in this area. Example – Database schema. These are volatile and concrete and many components depend on them.

Non volatile components are harmless in this zone as they don’t change. Only the volatile components cause pain like database schema mentioned above.

The Zone of Uselessness

Component at (1, 1) is maximally abstract and has no dependents. These components are useless and so the zone is called zone of uselessness.

These are generally dead code often leftover from abstract classes that never were implemented.

Main Sequence

Components that lie on this line are not “too abstract” or “too stable”. It is depended on to the extent that it is abstract and it depends on others to the extent it is concrete. As good architects, we should try to place our components on this line.

Distance from the main sequence

Distance or D = | A + I -1 |, range is between (0, 1). 0 indicates component is on main sequence, 1 indicates it is far from main sequence.

Using this metric, design can be analyzed for its overall conformance to main sequence. We can calculate D metric fir each component. Then based on its value, component can be evaluated and changed.

Thanks for stopping by! Hope this gives you a brief overview in to Stable Abstractions Principle. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
Clean Architecture – Chapter 14 | Component Coupling | SDP | The Stable Dependencies Principle

May 20th, 2023

In our previous post, we discussed The Acyclic Dependency Principle, in this chapter we will go over next principle called SDP (Stable Dependencies Principle)

Depend in the direction of stability
Robert Martin

We need to ensure modules that are easy to change should not be depended on by modules that are harder to change.

What is a stable component?

Let’s say we have module X, and we have three modules that depend on X. Then X is referred to as Independent or Stable component as it has no external reason to change.

Let’s say, we have module Y which depends on three modules, so any change in any of those three modules would cause Y to change and so Y is referred to as Dependent.

Stability Metrics

Fan-in – Number of incoming dependencies, i.e., number of classes outside component that depend on classes within the component.

Fan-out – Number of outgoing dependencies, i.e., number of classes inside component that depends on classes outside the component.

Instability –

I = Fan-out / (Fan-in + Fan-out)

The range of this metric is between 0 and 1.

I = 0 => maximally stable component, component X from above example

I = 1 => maximally unstable component, component Y from above example

As per SDP, I of a component should be larger than I of the components that it depend on. I should decrease in direction of dependency.

Not all components should be maximally stable as that would create a very rigid system. We need a balance of stable and unstable components. Ideally in our system, we should have unstable components on top and then should be depending on stable components at bottom.

Thanks for stopping by! Hope this gives you a brief overview in to Stable Dependencies Principle. Eager to hear your thoughts and chat, please leave comments below and we can discuss.

DDIA | Chapter 1 | Reliable, Scalable, And Maintainable Applications – Scalability

DDIA | Chapter 1 | Reliable, Scalable, And Maintainable Applications – Reliability

Hardware Faults

Software Errors

Human Errors

Feature Engineering

Feature Selection

Feature Creation

Feature transformation

Feature Engineering pipeline

Statistical Bias – What is it? | What causes it in a ML dataset?

What is Statistical Bias?

Causes of Statistical Bias

Clean Architecture – Chapter 14 | Component Coupling | SAP | The Stable Abstractions Principle

Measuring Abstraction

The Main Sequence

The Zone of Pain

The Zone of Uselessness

Main Sequence

Distance from the main sequence

Clean Architecture – Chapter 14 | Component Coupling | SDP | The Stable Dependencies Principle

What is a stable component?

Stability Metrics