Lets Code Them Up!

Generative AI for Everyone | Notes | Week 3

January 27th, 2024
Recently I finished the Andrew Ng’s course on Coursera – Generative AI for Everyone. These are my notes for week 3 from that course.

Most of this week was focused on how generative AI is useful and it is not going to take away jobs from people as it is projected. if there will be jobs impacted, then AI will create more jobs.

Generative AI and Business – Most of business contains many jobs/roles. Each role carries out a set of task. The goal of the business should be to see which of the tasks could be automated using AI such that it helps the person in that role.

For example, a gardener
1. trims the grass,
2. pulls out the weeds,
3. collects and throws garden trash
4. plants grass
5. plants fruits
6. adds soil
7. and so on
As we can see it’s now possible to automate any of the tasks above, so the role of gardener cannot be automated using generative AI.

Next, we see that generative AI can be used to augument an existing job instead of replacing it.

For example, let’s think of surgeon, who has to perform a complicated surgery next day. Right now, he has to go through various medical text books, research papers and so on to research the procedure and then next day perform the surgery. With the help of Generative AI tools, the same research could be done in much less time and so surgeon can perform surgery sooner.

Generative AI will most likely affect the jobs of knowledge workers.

Concerns about AI –
1. Amplifying humanity’s worst impulses – Generative AI is trained using the data on the internet which contains bias for example. And so the AI will also have this bias in it’s answers. This bias can be reduced using the RLHF ( Reinforcement learning using human feedback)
2. Job loss – As we said earlier, AI automates/auguments various tasks of a job, it can’t automate the whole job. So job loss won’t be that significant. For example, a radiologist has many tasks like
  - operating imaging software
  - communicating results
  - interpreting x-ray
  - handling complications during procedure and so on.
    Lot of these tasks can not be automated. Only the “interpreting x-ray” could be automated. So we will always need a radiologist. However, a radiologist who uses AI to help could replace current radiologist in future.
3. Causing Human Extinction – AI can cause human extinction looks to be far fetched.
Artificial General Intelligence

AI that can do any intellectual task that a human can.

Examples:
- Learn to drive a car through ~20 hours of practice.
- Complete a PhD thesis after ~5 years of work.
- Do all the tasks of a computer programmer (or any other knowledge worker)
Responsible AI
- Fairness: Ensuring AI does not perpetuate or amplify biases
- Transparency: Making AI systems and their decisions understandable to stakeholders impacted
- Privacy: Protecting user data and ensure confidentiality
- Security: Safeguard AI systems from malicious attacks
- Ethical Use: Ensuring AI is used for beneficial purposes
Thanks for stopping by!
Generative AI for Everyone | Notes | Week 2

January 20th, 2024
Recently I finished the Andrew Ng’s course on Coursera – Generative AI for Everyone. These are my notes for week 2 from that course.

The second week was focused on using generative AI to create software applications – how much it costs/how much time it takes.

Let’s consider building in software application for restaurant reputation monitoring using Machine Learning.
1. Collecting data (few hundred/few thousand), label it. (approx. 1 month)
2. Find the correct AI model to train on the data to learn how to output positive/negative depending on the input. (approx. 3 months)
3. Find a cloud service to deploy and use the model. (approx. 3 months)
This process generally takes months to complete. However you can do the same using generative AI within weeks. Here’s the steps involved –
1. Scope the project
2. Build the initial prototype and then work on improving it.
3. Evaluation of the outputs to increase the system’s accuracy.
4. Deploy using a cloud service and monitor
Retrieval Augmented Generation –

As we know LLM takes prompt as input and the length of prompt is limited, if we provide additional context to improve the answer, the length of that context is limited.

RAG is additional technique using which you can provide additional context to LLM without increasing the prompt length. It occurs in three steps –
1. Given a prompt, search all relavent documents to generate the answer.
2. Add that retrieved text to the prompt.
3. Then feed the new prompt with the additional context to LLM.
Many applications use this technique, example coursera coach, which uses the coursera website information to answer specific questions that student ask. Many companies are creating ChatBots for their company offerings and they use RAG to provide the additional context.

Fine-tuning –

Most of the generative AI models are trained on data from web. They are general purpose LLMs. We can use this LLM and fine tune it on our domain specific data, so that the LLM learns to give the niche output we want.

For example, a general purpose LLM will not be able to give correct output on medical records/legal records. We can first train our LLM on these records so that it starts giving correct output for new medical records.

BloombergGPT is one such solution which was build specifically for financial data.

How to choose a model?
1. Based on Model size –
  - 1B parameters – pattern matching and basic knowledge of the world
  - 10B parameters – greater world knowledge and can follow basic instructions
  - 100B parameters – rich world knowledge and complex reasoning.
2. Open source/closed source
  - Closed source models – available through cloud programming interface, easy to use, not that costly
  - Open source models – full control over the model and data, can be run using your own systems.
LLM, Instruction Tuning and RLHF –

Instruction Tuning is training your LLM on specific set of questions and answers or examples of LLM following instructions, so it learns how to follow specific type of questions (instructions).

Re-inforcement Learning from Human Feedback – to further improve LLM, we can use supervised learning to rate LLM answers.
1. Train an answer quality model. Given a prompt, we will get multiple answers from LLM, and then we can store these answers and their rating (score by humans) into a dataset. Then we train a ML model on this data to automatically rate the answer.
2. Have LLM generate more responses for different prompts and train the LLM to generate answers with higher rating.
Thanks for stopping by!
Generative AI for Everyone | Notes | Week 1

January 13th, 2024
Recently I finished the Andrew Ng’s course on Coursera – Generative AI for Everyone. These are my notes for week 1 from that course.

Generative AI is defined as artificial intelligence systems that can generate high quality content like images, text, audio and video. Through chatGPT, we have seen it can produce text. Adobe has AI in it’s tool, using which we can create images using prompts.

Andrew makes a point in one of the videos that AI technology is general purpose. Just like electricity is used to power many things, AI can be applied to various problems. We already see use of AI applications in day to day lives like spam prediction, recommendations on Amazon/netflix, chatGPT, etc.

Some applications for generative ai –
1. Writing – As LLMs work by predicting next words and sentences, it can be used to write something for you. Example, you can ask LLMs to write your linkedin post/blog post for you. LLMs are also used for translating from one language to another.
2. Reading – LLMs can also read long texts and create smaller inference from it. For example, you can ask LLM to go through your resume and create a linkedin summary for you.
3. Chatting – LLMs can be build to use specialized chatbots for an organization as per the requirements.
Limitations of LLMs –
1. Knowledge cutoff – LLM’s knowledge is confined to the data it was used to train it’s model.
2. Hallucinations – LLMs can make stuff in very confident and authoritative.
3. Input length (prompt length/context length) and output is limited.
4. It doesn’t work well with tabular (structured) data.
5. LLMs can reflect the bias of the data it was trained on.
Thanks for stopping by!
DDIA | Partitioning | Rebalancing Partitions

November 18th, 2023
In previous posts, we learned about partitioning schemes for key-value pairs and secondary indexes, in all of these schemes there is a need for redistributing data either when there is a new node or a new partition. In this post, we are going to touch upon the techniques to do handle that data migration.

The process of moving load from one node in the cluster to another is called rebalancing..

Some requirements for rebalancing –
- After rebalancing, the load (data storage, read and write requests) should be shared fairly between the nodes in the cluster.
- While rebalancing is happening, the database should continue accepting reads and writes.
- No more data than necessary should be moved between nodes, to make rebalancing fast and to minimize the network and disk I/O load.
Strategies for Rebalancing
1. Fixed number of partitions – create many more partitions than there are nodes, and assign several partitions to each node. Only entire partitions are moved between nodes. if a node is added to the cluster, the new node can steal a few partitions from every existing node until partitions are fairly distributed once again. The number of partitions does not change, nor does the assignment of keys to partitions. The only thing that changes is the assignment of partitions to nodes. This change of assignment is not immediate—it takes some time to transfer a large amount of data over the network—so the old assignment of partitions is used for any reads and writes that happen while the transfer is in progress.
2. Dynamic partitioning – Key range–partitioned databases such as HBase and RethinkDB create partitions dynamically. When a partition grows to exceed a configured size (on HBase, the default is 10 GB), it is split into two partitions so that approximately half of the data ends up on each side of the split. Conversely, if lots of data is deleted and a partition shrinks below some threshold, it can be merged with an adjacent partition. This process is similar to what happens at the top level of a B-tree.
  
  Each partition is assigned to one node, and each node can handle multiple partitions, like in the case of a fixed number of partitions. After a large partition has been split, one of its two halves can be transferred to another node in order to balance the load. In the case of HBase, the transfer of partition files happens through HDFS, the underlying distributed filesystem.
  
  An advantage of dynamic partitioning is that the number of partitions adapts to the total data volume. If there is only a small amount of data, a small number of partitions is sufficient, so overheads are small; if there is a huge amount of data, the size of each individual partition is limited to a configurable maximum.
3. Partitioning proportionally to nodes – make the number of partitions proportional to the number of nodes—in other words, to have a fixed number of partitions per node. In this case, the size of each partition grows proportionally to the dataset size while the number of nodes remains unchanged, but when you increase the number of nodes, the partitions become smaller again. Since a larger data volume generally requires a larger number of nodes to store, this approach also keeps the size of each partition fairly stable.
Operations: Automatic or Manual Rebalancing

Automatic rebalancing – the system decides automatically when to move partitions from one node to another, without any administrator interaction. Fully automated rebalancing can be convenient, because there is less operational work to do for normal maintenance. However, it can be unpredictable.

Rebalancing is an expensive operation, because it requires rerouting requests and moving a large amount of data from one node to another. If it is not done carefully, this process can overload the network or the nodes and harm the performance of other requests while the rebalancing is in progress. SO it’s better to have human in the loop ie use the manual rebalancing.

Manual rebalancing – the assignment of partitions to nodes is explicitly configured by an administrator, and only changes when the administrator explicitly reconfigures it.

Thanks for stopping by! Hope this gives you a brief overview in to rebalancing partitions. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
DDIA | Partitioning | Partitioning and Secondary Indexes

November 13th, 2023

In previous post we learned about the partitioning schemes based on key-value pairs.

If records are only ever accessed via their primary key, we can determine the partition from that key and use it to route read and write requests to the partition responsible for that key.

The situation becomes more complicated if secondary indexes are involved as they are not unique. The problem with secondary indexes is that they don’t map neatly to partitions. There are two main approaches to partitioning a database with secondary indexes: document-based partitioning and term-based partitioning.

Partitioning Secondary Indexes by Document

Let’s take the example given in the book. For example, we are running a website to buy/sell used cars. Each listing has unique ID called DocumentID and we partition the database by documentID.

If we want our users to search by make or color, then we will need to add them as secondary indexes on the database. Database takes care of indexing if we have declared the index. It will assign the new entries to the correct partition.

Whenever you write to the database—to add, remove, or update a document—you only need to deal with the partition that contains the document ID that you are writing. For that reason, a document-partitioned index is also known as a local index .

However, reading from a document-partitioned index requires care: f you want to search using secondary index, you need to send the query to all partitions, and combine all the results you get back, also known as scatter/gather. it can make read queries on secondary indexes quite expensive. Even if you query the partitions in parallel, scatter/gather is prone to tail latency amplification.

Nevertheless, it is widely used: MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, and VoltDB all use document-partitioned secondary indexes.

Partitioning Secondary Indexes by Term

Rather than each partition having its own secondary index (a local index), we can construct a global index that covers data in all partitions. However, we can’t just store that index on one node, since it would likely become a bottleneck and defeat the purpose of partitioning. A global index must also be partitioned, but it can be partitioned differently from the primary key index.

For example, red cars from all partitions appear under color:red in the index, but the index is partitioned so that colors starting with the letters a to r appear in partition 0 and colors starting with s to z appear in partition 1. The index on the make of car is partitioned similarly (with the partition boundary being between f and h).

As before, we can partition the index by the term itself, or using a hash of the term. Partitioning by the term itself can be useful for range scans (e.g., on a numeric property, such as the asking price of the car), whereas partitioning on a hash of the term gives a more even distribution of load.

The advantage of a global (term-partitioned) index over a document-partitioned index is that it can make reads more efficient: rather than doing scatter/gather over all partitions, a client only needs to make a request to the partition containing the term that it wants.

However, the downside of a global index is that writes are slower and more complicated, because a write to a single document may now affect multiple partitions of the index (every term in the document might be on a different partition, on a different node).

Thanks for stopping by! Hope this gives you a brief overview in to partitioning and secondary indexes. Eager to hear your thoughts and chat, please leave comments below and we can discuss.
DDIA | Partitioning | Partitioning of Key-Value Data

November 4th, 2023
In previous few posts, we learned about Replication, from this post, we will look into partitions.

Replication is having multiple copies of same data on multiple nodes. For very large datasets, or very high query throughput, that is not sufficient: we need to break the data up into partitions, also known as sharding.

The main reason for wanting to partition data is scalability. Different partitions can be placed on different nodes in a shared-nothing cluster. Thus, a large dataset can be distributed across many disks, and the query load can be distributed across many processors.

Partitioning of Key-Value Data

Skewed partitioning – If the partitioning is unfair, so that some partitions have more data or queries than others, we call it skewed

Hotspot – all the load could end up on one partition, so 9 out of 10 nodes are idle and your bottleneck is the single busy node. A partition with disproportionately high load is called a hot spot.
1. Partitioning by Key Range – One way of partitioning is to assign a continuous range of keys (from some minimum to some maximum) to each partition. If you know the boundaries between the ranges, you can easily determine which partition contains a given key. If you also know which partition is assigned to which node, then you can make your request directly to the appropriate node.
  
  The downside of key range partitioning is that certain access patterns can lead to hot spots. If the key is a timestamp, then the partitions correspond to ranges of time, and it’s possible for some of the data to be accessed at particular times while nothing gets accessed in other times.
2. Partitioning by Hash of Key – Using hash function to determine partition of keys. Once you have a suitable hash function for keys, you can assign each partition a range of hashes (rather than a range of keys), and every key whose hash falls within a partition’s range will be stored in that partition.
  
  Unfortunately however, by using the hash of the key for partitioning we lose a nice property of key-range partitioning: the ability to do efficient range queries. Keys that were once adjacent are now scattered across all the partitions, so their sort order is lost. This can be solved using compound primary key consisting of several columns. Only the first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data. (used by cassandra).
Skewed Workloads and Relieving Hot Spots

Consider this example – for example, on a social media site, a celebrity user with millions of followers may cause a storm of activity when they do something. This event can result in a large volume of writes to the same key (where the key is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on). Hashing the key doesn’t help, as the hash of two identical IDs is still the same.

Today, most data systems are not able to automatically compensate for such a highly skewed workload, so it’s the responsibility of the application to reduce the skew.

For example, if one key is known to be very hot, a simple technique is to add a random number to the beginning or end of the key. However, having split the writes across different keys, any reads now have to do additional work, as they have to read the data from all keys and combine it.

This technique also requires additional bookkeeping: it only makes sense to append the random number for the small number of hot keys; for the vast majority of keys with low write throughput this would be unnecessary overhead. Thus, you also need some way of keeping track of which keys are being split.

Thanks for stopping by! Hope this gives you a brief overview in to partitioning. Eager to hear your thoughts and chat, please leave comments below and we can discuss.

Generative AI for Everyone | Notes | Week 3

Generative AI for Everyone | Notes | Week 2

Retrieval Augmented Generation –

Fine-tuning –

How to choose a model?

LLM, Instruction Tuning and RLHF –

Generative AI for Everyone | Notes | Week 1

DDIA | Partitioning | Rebalancing Partitions

Strategies for Rebalancing

Operations: Automatic or Manual Rebalancing

DDIA | Partitioning | Partitioning and Secondary Indexes

Partitioning Secondary Indexes by Document

Partitioning Secondary Indexes by Term

DDIA | Partitioning | Partitioning of Key-Value Data

Partitioning of Key-Value Data

Skewed Workloads and Relieving Hot Spots