diff --git "a/mini-llama-articles.csv" "b/mini-llama-articles.csv" deleted file mode 100644--- "a/mini-llama-articles.csv" +++ /dev/null @@ -1,15 +0,0 @@ -title,content,url,source -Beyond GPT-4: What's New?,"LLM Variants and Meta's Open Source Before shedding light on four major trends, I'd share the latest Meta's Llama 2 and Code Llama. Meta's Llama 2 represents a sophisticated evolution in LLMs. This suite spans models pretrained and fine-tuned across a parameter spectrum of 7 billion to 70 billion. A specialized derivative, Llama 2-Chat, has been engineered explicitly for dialogue-centric applications. Benchmarking revealed Llama 2's superior performance over most extant open-source chat models. Human-centric evaluations, focusing on safety and utility metrics, positioned Llama 2-Chat as a potential contender against proprietary, closed-source counterparts. The development trajectory of Llama 2 emphasized rigorous fine-tuning methodologies. Meta's transparent delineation of these processes aims to catalyze community-driven advancements in LLMs, underscoring a commitment to collaborative and responsible AI development. Code Llama is built on top of Llama 2 and is available in three models: Code Llama, the foundational code model;Codel Llama - Python specialized for Python;and Code Llama - Instruct, which is fine-tuned for understanding natural language instructions. Based on its benchmark testing, Code Llama outperformed state-of-the-art publicly available LLMs (except GPT-4) on code tasks. Llama 2, Llama 2-Chat, and Code Llama are key steps in LLM development but still have a way to go compared to GPT-4. Meta's open access and commitment to improving these models promise transparent and faster LLM progress in the future. Please refer to the LLM and Llama variants below: From LLMs to Multimodal LLMs, like OpenAI's ChatGPT (GPT-3.5), primarily focus on understanding and generating human language. They've been instrumental in tasks like text generation, translation, and even creative writing. However, their scope is limited to text. Enter multimodal models like GPT-4. These are a new breed of AI models that can understand and generate not just text, but also images, sounds, and potentially other types of data. The term ""multimodal"" refers to their ability to process multiple modes or types of data simultaneously. This is a game-changer. Imagine an AI that can not only read a description of a dress but also visualize it or even design it! Multimodal AI models are moving us towards more holistic AI systems. These systems can potentially understand our world in a more comprehensive manner, bridging the gap between different forms of data and providing richer, more integrated solutions. As we stand on the cusp of this new era, it's exciting to envision the myriad of applications and innovations that Multimodal models will bring to the table. The future of AI looks more integrated and versatile than ever before. From Connections to Vector DB The AI landscape is witnessing a fascinating transition: from Language Model (LLM) connections or integrations, e.g., LangChain and LlamaIndex, to the rise of Vector Databases (Vector DB) such as Weaviate, Milvus, Pinecone, Chroma, and Vespa.ai. But what's driving this shift, and why does it matter? LLM connections, like the LlamaIndex, primarily focus on linking and understanding vast amounts of external data. They've been pivotal in creating semantic connections, enabling more intuitive search experiences, and enhancing data accessibility. However, as the volume and variety of data grow, the need for more advanced storage and retrieval mechanisms becomes evident. This is where Vector DBs come into play. Unlike traditional databases that store data in rows and columns, Vector DBs store data in high-dimensional space, allowing for more efficient and accurate similarity searches. Tools like Weaviate and Milvus are designed to handle massive datasets, making them ideal for tasks like image recognition, recommendation systems, and more. The rise of Vector DBs represents a broader trend in AI: the quest for more efficient, scalable, and versatile data handling solutions. As we navigate this evolution, it's clear that the combination of LLMs and Vector DBs will redefine how we store, access, and understand data in the AI-driven future. From Agents to OS The AI realm is abuzz with innovations, and one of the most intriguing shifts we're witnessing is the transition from LLM agents to using LLMs as Operating Systems (OS). Let's delve into this evolution and its implications. LLM agents, like AutoGPT, AgentGPT, BabyAGI, and HuggingGPT, have been groundbreaking in automating tasks based on user requests. These agents leverage the power of Language Models (LLMs) to understand and execute commands, making them invaluable in tasks ranging from content generation to data analysis. Their adaptability and intelligence have made them a staple in many AI toolkits. However, the vision for AI doesn't stop there. The concept of LLM as an OS is emerging as the next big thing. Imagine an operating system where the core is a language model, orchestrating everything around it. Such a system would not just execute tasks but would understand context, anticipate needs, and offer solutions in real time. It's like turning the LLM into the brain of the digital ecosystem, making devices and applications more intuitive and responsive than ever. The move towards LLM as OS signifies a paradigm shift in how we perceive and utilize AI. It's not just about automation anymore; it's about creating a seamless, intelligent interface between humans and technology. As we stand on the brink of this transformation, the potential for LLM-driven OS to revolutionize our digital interactions is immense. From Fine-tuning to Plugins The world of LLMs is undergoing a transformative shift, moving from intricate fine-tuning processes to the more dynamic realm of plugins. Let's unpack this evolution. Historically, fine-tuning has been the cornerstone of LLM optimization. There are two primary ways to fine-tune LLMs: feeding data into the LLM in real-time and directly fine-tuning on the LLM. From a technical standpoint, this involves three methods: Transfer Learning: Adapting a pre-trained model to new tasks.Sequential Fine-tuning: Refining models in stages for specific tasks.Task-specific Fine-tuning: Tailoring models for a particular function. Moreover, LLM techniques like In-context learning, Few-shot learning, and Zero-shot learning have further enhanced the model's adaptability, allowing them to understand and generate content with minimal data. However, the future of LLMs is leaning towards plugins. With the introduction of tools like GPT-4 Plugins, the focus is on extending LLMs seamlessly. Instead of running LLMs as a service, they're envisioned as platforms. This means integrating LLMs with various tools, enhancing their capabilities, and offering a more modular and scalable approach to AI applications. The journey from fine-tuning to plugins represents a move from static optimization to dynamic adaptability, ensuring that LLMs remain at the forefront of AI innovation. In a Nutshell The AI domain is witnessing rapid shifts, with LLMs playing a central role. Initially, the move was from LLMs to Multimodal models, expanding from text to include images and sounds. Simultaneously, the trend shifted from LLM connections, which linked external data, to Vector Databases for efficient high-dimensional storage. Another evolution saw LLM agents, which automated tasks, transitioning towards LLMs as Operating Systems. This change aims for more intuitive, context-aware devices and applications. Furthermore, the traditional fine-tuning processes of LLMs are now being replaced by dynamic plugins, turning LLMs into platforms integrated with various tools. Leading this LLM revolution are OpenAI's GPT-4 and Meta's LLaMA2. Their pioneering efforts are setting the stage for an AI future that's more integrated, responsive, and attuned to human interactions. More Readings Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond: https://arxiv.org/abs/2304.13712Sparks of Artificial General Intelligence: Early experiments with GPT-4: https://arxiv.org/abs/2303.12712GPT4All-J: https://huggingface.co/nomic-ai/gpt4all-jIntroducing Code Llama, a state-of-the-art large language model for coding: https://ai.meta.com/blog/code-llama-large-language-model-coding/Llama 2: Open Foundation and Fine-Tuned Chat Models: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/",https://pub.towardsai.net/beyond-gpt-4-whats-new-cbd61a448eb9#dda8,towards_ai -Building a Q&A Bot over Private Documents with OpenAI and LangChain,"Private data to be used The example provided can be used with any dataset. I am using a data set that has Analyst recommendations from various stocks. For the purpose of demonstration, I have gathered publicly available analyst recommendations to showcase its capabilities. You can replace this with your own information to try this. Below is a partial extract of the information commonly found in these documents. If you wish to try it yourself, you can download analyst recommendations for your preferred stocks from online sources or access them through subscription platforms like Barron's. Although the example provided focuses on analyst recommendations, the underlying structure can be utilized to query various other types of documents in any industry as well. I have assembled such data for a few stocks for demonstration purposes. This includes Google, Microsoft, Meta, and Tesla. To facilitate easy access and updating of analysts' recommendations, all the recommendations can be organized into a designated folder. Each stock corresponds to a separate file within this folder. For example, if there are recommendations for 20 stocks, there will be 20 individual files. This organization enables convenient updating of information for each stock as new recommendations arrive, streamlining the process of managing and maintaining the most up-to-date data for each stock. Questions this Q&A bot application can answer The data we have for this application is stock market analyst recommendations for many stocks. Let's say you are looking for insight about Microsoft stock. You can ask any of the following questions as an example: What is the median target price for Microsoft (MSFT)?What is the highest price estimate for Microsoft (MSFT)?What is the lowest price estimate for Microsoft (MSFT)?How much percentage increase is expected in the stock price of Microsoft (MSFT)?How many analysts provided price forecasts for Microsoft (MSFT)?What is the current consensus among investment analysts regarding Microsoft (MSFT)?Has the consensus rating for Microsoft (MSFT) changed recently?When was the consensus rating last updated for Microsoft (MSFT)?Is the current recommendation for Microsoft (MSFT) to buy, sell, or hold the stock?Are there any recent analyst reports available for Microsoft (MSFT)? These questions cover various aspects of the stock analysis, including price forecasts, analyst recommendations, and recent changes in ratings. The chat system can provide specific answers based on the information available in the financial documents. Please note that you can not only ask questions about an individual stock but can also ask comparative questions across stocks. For example, which stock has the most price increase? Here the system will compare the price increase across all the stocks and provide an answer. Quick summary of how the web application works This web-based application allows users to input their questions in a text box and receive answers based on insights gathered from multiple documents. For instance, users can inquire, ""What is the highest price estimate for Microsoft?"" and the application will query the relevant documents to provide an accurate response. Moreover, users can also compare stocks by asking questions such as, ""Which stock, Meta or Microsoft, has a higher percentage increase in the stock price?"" The application will analyze the data across the documents, enabling users to make informed investment decisions based on the comparative insights provided. Application Overview The application is built with LangChain and ChatGPT. Though it uses ChatGPT, we can also wire this to other LLMs as well. LangChain is an innovative framework designed to empower you in building sophisticated applications driven by large language models (LLMs). By offering a standardized interface, LangChain facilitates the seamless integration of various components, including LLMs, data sources, and actions. This streamlined approach accelerates the development of robust applications, enhanced by features such as chaining, data awareness, and agentic capabilities. To complement LangChain, the web application is built utilizing Streamlit, a Python library for creating interactive web applications and data dashboards. Streamlit's open-source nature and user-friendly features simplify the process of developing web apps with minimal effort. This has made it a popular choice among developers, data scientists, and machine learning engineers seeking to build engaging and accessible applications. Initial setup Install OpenAI, LangChain, and StreamLit Import the relevant packages Set the API keys Define the LLM to use Ingesting private documents We used Langchain to ingest data. LangChain offers a wide range of data ingestion methods, providing users with various options to load their data efficiently. It supports multiple formats, including text, images, PDFs, Word documents, and even data from URLs. In the current example, text files were utilized, but if you wish to work with a different format, you simply need to refer to the corresponding loader specifically tailored for that format. All the analysts' recommendations documents are stored in a dedicated folder. You have the flexibility to either refer to individual documents or retrieve all the documents within a specific folder. If you want to specify exact documents, you can do it the following way. To load the files you want to ingest, you can specify the path to each file individually. The loaded files can then be saved into a list. This list serves as the input that is sent to the vector database to store the data. The alternative approach is a more versatile method in which we can load all pertinent documents from a designated folder and store the file locations in a list for subsequent processing. This approach offers flexibility and allows for the efficient handling of multiple documents by capturing their locations in a centralized list, enabling seamless data retrieval and analysis. Load the documents into the vector store. When dealing with a vast number of documents, it becomes inefficient to send all documents (analyst recommendations) to your large language model (LLM) when seeking answers to specific questions. For instance, if your question pertains to MSFT, it would be more cost-effective to only send document extracts that reference MSFT to your LLM for answering the question. This approach helps optimize resource utilization. To achieve this, all documents are split into chunks and stored in a vector database in a numeric format (embeddings). When a new question is posed, the system queries the vector database for relevant text chunks related to this question, which is then shared with the LLM to generate an appropriate response. Within the LangChain framework, the VectorstoreIndexCreator class serves as a utility for creating a vector store index. This index stores vector representations of the documents (in chromadb), enabling various text operations, such as finding similar documents based on a specific question. When a user asks a question, a similarity search is performed in the vector store to get document chunks relevant to the question. The question, along with the chunks are sent to OpenAI to get the response back. Now we are ready to query these documents. Setting up the web application The application is presented in the browser using Streamlit, providing a user-friendly interface. Within the application, a text box is available for users to enter their questions. Upon submitting the question by pressing enter, the application processes the input and generates a corresponding response. This response is then displayed below the text box, allowing users to conveniently view the relevant information. Create a prompt based on the question asked by the user and display the response back to the user By calling index.query() with the specified parameters, you initiate the process of querying the vector database using the provided question. Vector database provides relevant text chunks that are relevant to the question asked. These text chunks, along with the original question, is passed to LLM. The LLM is invoked to analyze the question and generate a response based on the available data sent. The specific chaining process associated with the query is determined by the chain_type parameter, which is to use all the data (filtered by the question) sent to LLM. Now the entire application is ready, and let's take it for a spin next. Ask few questions Let's try few questions The range of questions encompasses diverse facets of stock analysis, encompassing price forecasts, analyst recommendations, and recent rating changes. The chat system excels in delivering precise answers by leveraging the information contained within the financial documents. The system extends beyond individual stock inquiries and accommodates comparative queries across multiple stocks. For instance, one can ask about the stock with the highest price increase, prompting the system to compare price increases across all stocks and provide a comprehensive response. This versatility allows users to gain insights and make informed decisions across a broader spectrum of stock analysis. Conclusion The development of a Q&A bot over private documents using OpenAI and LangChain represents a remarkable achievement in unlocking the invaluable knowledge hidden within private document repositories. This web-based Q&A bot has the potential to empower users from various industries, enabling efficient access and analysis of critical information and ultimately enhancing productivity and decision-making capabilities. While we showcased a finance example to illustrate the concept, the bot's functionality extends to any domain. Simply by providing a folder with the relevant privacy documents, users can engage in natural language conversations with the bot. Once the data is ingested into a vector database, users can seamlessly query and retrieve information, propelling the capabilities of intelligent document analysis to new heights.",https://pub.towardsai.net/building-a-q-a-bot-over-private-documents-with-openai-and-langchain-be975559c1e8#bead,towards_ai -Enhancing E-commerce Product Search Using LLMs,"Problem Statement Despite the pioneers like Amazon [2], many E-commerce platforms are still heavily relying on traditional retrieval techniques like TFIDF and BM25 for product search. Such sparse methods usually require customers to type explicit queries that match the product information and mostly struggle to achieve good relevance for queries that are colloquial and implicit. In consequence, the search engine either returns no result or results with low relevance ignoring the existence of the relevant ones, which harms the customer experience and business metrics. For instance, Ebay is returning ""No exact matches found"" for the query ""What are the best gifts for boys under 5?"". Although the ""Results matching fewer words"" solution avoids the ""no result"" situation, its search relevance has got the obvious potential to be improved. People might argue that it's rare for such queries to occur. However, it's not uncommon that many opportunities and advancements are actually driven by the use cases that are underestimated in the beginning. LLM-based Solution Today, thanks to the fast development of LLMs, one can quickly build prototypes without worrying about the effort needed to build in-house solutions from scratch. This enables my quick discovery to tackle the problem. As depicted in the image below, the idea is pretty straightforward. The LLM is leveraged to translate the raw query to an enhanced query that aims to contain the explicit product information for search. Potentially, the product range covered in the enhanced query could be broad for the raw query that is implicit and fuzzy. In consequence, sending the enhanced query directly to the keyword-based search engine will likely lead to poor results due to its ambiguity and uncertainty. As a solution, LLM embedding is adopted to address the semantic complexity. Specifically, the enhanced query is projected into the embedding space that contains the preprocessed product embeddings. Next, the product retrieval is done by comparing the similarity between the query embedding and product embeddings, which then generates the top-k products as search results. There is a wide range of techniques to implement the idea as there exist many options for each step. Here, I provide one example implementation based on Hugging Face and LangChain. The actual code is hosted on the Github repo below, with the details explained as follows. Generate the enhanced query First, the recently announced Llama 2 is adopted as the LLM to generate the enhanced query for a given raw query. As demonstrated below, the Hugging Face pipeline is used, considering its simplicity. It's worth noting that the pipeline itself is enough to accomplish the task so the use of LangChain is totally optional. The prompt template adopted here aims to generate relevant and diverse product names to address the fuzziness of the raw query. Create product embeddings Next, the sentence transformer and FAISS in LangChain are used to create and store the product embeddings based on the product titles in the inventory. Here, due to the lack of access to actual search engines, the offline Ebay product dataset ""products.csv"" is adopted as the mockup of the E-commerce product inventory. This dataset contains approximately 3,000 products covering a wide range of categories. Product retrieval When it comes to retrieval, the same sentence transformer model that encodes the products is used again to generate the query embedding for the enhanced query. Finally, the top-10 products are retrieved based on the similarity between the query embedding and product embeddings. Showcase To demonstrate the effectiveness of this approach, let's look at the above-mentioned query ""What are the best gifts for boys under 5?"" and compare the LLM enhancement with the original Ebay search results presented in Figure 1. First, after receiving the raw query, Llama 2 generates 10 products as instructed by the prompt template. They look pretty impressive for boys' gift ideas although a better product-level granularity is expected. Next, let's have a look at the similarity match in the embedding space. What are retrieved from the product inventory mockup are not bad at all in comparison with the results of the real-world Ebay search engine in Figure 1. Due to the limited product range of the inventory mockup, the comparison is somewhat unfair but we are still able to observe the significant difference before and after applying LLM. Overall, the retrieval in embedding space achieves both relevance and diversity. Final thoughts After conducting the initial discovery, it is obvious that LLMs are a powerful tool to enhance the product search of E-commerce platforms. For this task, there are many future explorations to conduct, including prompt engineering for generating queries, product embeddings with enriched attributes, online latency optimization for LLM query enhancement, etc. Hope this blog could inspire the E-commerce platforms that need solutions to improve product search. References [1] Nayak, P. (2019) Understanding searches better than ever before, Google. Available at: https://blog.google/products/search/search-language-understanding-bert/ (Accessed: 09 August 2023).[2] Muhamed, A. et al. (no date) Web-scale semantic product search with large language models, Amazon Science. Available at: https://www.amazon.science/publications/web-scale-semantic-product-search-with-large-language-models (Accessed: 09 August 2023).",https://pub.towardsai.net/enhancing-e-commerce-product-search-using-llms-30d5a2117f71#e5f3,towards_ai -Exploring Large Language Models -Part 3,"Fine Tuning on Custom Domain Data All the popular models like GPT3/3.4/4 and LLAMA2 are trained primarily on the data scraped from the internet. Common Crawl, WebText, GitHub, StackOverflow etc: These are massive datasets of text and code that are crawled from the public web and a few curated like the QA dataset SQAD. The worldview and information the model has learned are also based on this data. However, this means that if we have some domain-specific data that the model has not seen, then it won't be able on its own to answer questions related to such data in case of Closed Book QA use-case or any other use case that depends on the specific domain data. For example, most online portals are adding virtual assistants for their customers, banks, e-commerce, customer support etc. And a huge if not the majority of data in the world still lives outside of the internet in enterprises. We have seen in Part 2 how LLMs can help address information retrieval use cases based on Vector space embeddings. But what if our use case is more high level? It needs domain ""understanding"", maybe some higher level reasoning tasks. This is where fine-tuning with custom data comes into play. I am not able to provide a use case where higher-level reasoning can be used. There are a few simpler ones, like training on custom issues and then asking it to reason on similar issues and possible solutions, but these are as of now not tested. So let's stick with a simpler use-case Closed-Book QA - the model answers questions based on the knowledge it internally has for now. The above is from a 2021 paper Can Generative Pre-trained Language Models Serve as Knowledge Bases for Closed-book QA? This is already outdated in the sense of the number and size of models and training released. The authors with 2021 models could not achieve great results and the great results they found in some studies described could be attributed to the high train and test overlap in datasets. There are also a lot of tutorials on the internet that try to portray this concept with toy datasets. The real trouble is making the model 'understand' the data first and not just parrot it out. Without understanding, it will parrot out the answer based on the similarity of the question in the training set, or both the question and answer. To prevent this, the authors have an intermediate step called 'Recite' where the model is made to recite/output the relevant passages and, after that, output the answer. Just to be clear, there is no doubt now (2023), especially with GPT3/4, LLAMA2 and similar models about the feasibility of this use case, that a model can understand the question, has some ability for causal reasoning, and can generalize to learn a world model from its training data, and to use both to create a well-formed answer to the question. Let's see the difficulties one by one however, of training a large model. First is the importance of the model size. This GIF from the Google AI blog illustrates this beautifully. It is relatively easy and cost-efficient to train or fine-tune a small model with our custom data, as the GPU and infrastructure requirements are very less. On the contrary, it needs huge fleets of GPUs and training infrastructure to load very large language models and fine-tune them (without quantisation) in a distributed way (e.g. see libraries like DeepSpeed) LLMs come in various sizes, based on the number of trainable parameters or weights. The smaller ones, which have less than 1 billion parameters (GPT2 124 M, Bloom 560M, Flan-T5 783 M ) etc can be trained on a laptop GPU with 8 to 15 GB GPU RAM ) For quite some time, this is what I tried. I tried to overfit a small test data set on decoder models like GPP2-small, GPT-Medium, and Bloom and encoder-decoder models like Flan-T5, thinking somehow that the understanding we see in ChatGPT ( see- unsupervised learning Part 1) may come in some form if we train on these smaller models. ( less than one billion parameters). As per the paper, I tried both Causal training, where the model is presented with only previous tokens, and Masked LM-based training, where the model is presented with full tokens, but a certain percentage of tokens are masked in random, and the model has to predict it. The next option was to fine-tune a large model with the data. However, this is extremely difficult to do, and even if cloud-based solutions are used, it would be pretty expensive. (What OpenAI provides now is Instruct Fine-Tuning, which we will cover later) It takes months of GPU fleet time and a specialized library and infrastructure to distribute training across multiple GPUs needed to train LLMs. For example, even a relatively small model like the BigScience Bloom 3 Billion model, even when the weights are loaded in 16 Bit cannot be trained with A100 on ColabPro with 40GB GPU RAM ( the highest you can get) as it goes out of memory. Solution - Fine-Tuning Large Models via Qunaitsation and Parmeter Efficient Tuning The solution to this is to reduce the size of the models so that they can fit a commodity GPU and then fine-tune them. There are two parts to this- Quantisation and Parameter Efficient Tuning. The real magic of this is that a laptop with a sufficient recent GPU (having Tensor Cores), can run the 7 billion Lamma2 pre-trained model open-sourced recently by Meta Research. Imagine the compressed knowledge and an NLU (Natural Language Understanding) model running on your local laptop. This is still a smallish model, but it's still capable of understanding and has sufficient world knowledge embedded in it to be quite useful. Imagine what a model like this or better models in the future could do if it could run in small servers or in cars, and leverage its causal reasoning and world model knowledge to supervise lower-level/specialist AI/ML systems. So we have now a way to fit reasonably large models (7B or more) in a single GPU, via Quantisation and then train them in a parameter-efficient way via LoRa/QLoRa. Take 1: Un-supervised Training Fine-tuning with QLoRa Using the small training data and QLoRA, I first tried to train a large 7B Lamma2 model by feeding in the training text as is (Causal LM model training via UnSupervised learning). Note that this model was loaded in 4-bit, making it runnable on a single T4 GPU and trained with QLoRa. With QLoRA, only a fraction of the adapter weights are trained and summed with the existing frozen pre-trained weights of the model during inference. Here is an illustrative Colab notebook. You can see that training the model with just the text as is, does not result in proper output to questions. The answers are not affected by the training data. Take 2: Instruct Fine-tuning with QLoRa Instruction Tuning concept is a higher-level training concept introduced by this paper FineTuned Language Models Are Zero shot Learners (FLAN) We leverage the intuition that NLP tasks can be described via natural language instructions, such as ""Is the sentiment of this movie review positive or negative?"" or ""Translate 'how are you' into Chinese."" We take a pre-trained language model of 137B parameters and perform instruction tuning ... Since we use QLoRa we are effectively closely following this paper - QLORA: Efficient Finetuning of Quantized LLMs concerning the training data set, the format that the authors used to train their Gauanco model This is the format for the Llama2 model and will be different for others. One of the hardest problems of training is finding or creating a good quality data set to train. In our case, converting the available training data set to the instruction data set. Since our use case is Closed Book QA, we need to convert this to a QA format. Using older NLP methods like NER (Named Entity Recognition) and then using that to create a QA dataset was not effective. This is where the Self-instruct concept could be used However previous to Llama2, the best-performing model was the GPT 3/4 model via ChatGPT or its API and using these models to do the same was expensive. The 7 billion model of Llama2 has sufficient NLU (Natural Language Understanding) to create output based on a particular format. Running this in 4-bit mode via Quantisation makes it feasible compute-wise to run this on a large data set and convert it to a QA dataset. This was the prompt used. The context was a sliding window from the text dataset. Some minimal parsing and finetuning were done on the output of the model, and we could generate a QA dataset of the format below. This was fed to the QLoRA-based fine-tuning (Colab Notebook). We can see that the output from a fine-tuned 4-bit quantized llama2 7 B model is pretty good. Colab Notebook Trying to reduce hallucination via fine-tuning In the generated dataset, I added a specific tag `Source:8989REF`. The idea was that via attention, this token will be somehow associated with the text that we were training on. And then to use this hash somehow to tweak the prompt to control hallucination. Something like ""[INST] <>\nYou are a helpful Question Answering Assistant. Please only answer from this reference Source:8989REF"" However, that turned out to be a very naive attempt. Also, note that the generated QA missed transforming training data related to Professor Thiersch's method to a proper QA dataset. These and other improvements need to be experimented with, as well as to train with some completely new data that the model has not seen to test more effectively. Update: Training with new data was done by writing an imaginary story with ChatGPT help and then creating an instruction tuning data set (colab notebook). The model was then trained and tested (colab notebook) with this generated instruct dataset. The results confirm that the model learns via Instruct tuning, not only the fed questions but other details and relations of the domain. Problems with hallucinations remain (Bordor, Lila characters who are not in the story). The LLama2 13B 4-bit fine-tuned model has better output than the 7B model. A lot more needs to be explored in Fine-tuning. One observation is that slight changes in prompts give different answers. Since the output is not deterministic (that is, with even the same prompt, it varies over time), it is all the more difficult to fine-tune prompts to give the most effective output. This needs to be studied more. Also to be updated are higher level use-cases that should be possible with the fine-tuned models. Fine Tuning on Custom Domain Data All the popular models like GPT3/3.4/4 and LLAMA2 are trained primarily on the data scraped from the internet. Common Crawl, WebText, GitHub, StackOverflow etc: These are massive datasets of text and code that are crawled from the public web and a few curated like the QA dataset SQAD. The worldview and information the model has learned are also based on this data. However, this means that if we have some domain-specific data that the model has not seen, then it won't be able on its own to answer questions related to such data in case of Closed Book QA use-case or any other use case that depends on the specific domain data. For example, most online portals are adding virtual assistants for their customers, banks, e-commerce, customer support etc. And a huge if not the majority of data in the world still lives outside of the internet in enterprises. We have seen in Part 2 how LLMs can help address information retrieval use cases based on Vector space embeddings. But what if our use case is more high level? It needs domain ""understanding"", maybe some higher level reasoning tasks. This is where fine-tuning with custom data comes into play. I am not able to provide a use case where higher-level reasoning can be used. There are a few simpler ones, like training on custom issues and then asking it to reason on similar issues and possible solutions, but these are as of now not tested. So let's stick with a simpler use-case Closed-Book QA - the model answers questions based on the knowledge it internally has for now. The above is from a 2021 paper Can Generative Pre-trained Language Models Serve as Knowledge Bases for Closed-book QA? This is already outdated in the sense of the number and size of models and training released. The authors with 2021 models could not achieve great results and the great results they found in some studies described could be attributed to the high train and test overlap in datasets. There are also a lot of tutorials on the internet that try to portray this concept with toy datasets. The real trouble is making the model 'understand' the data first and not just parrot it out. Without understanding, it will parrot out the answer based on the similarity of the question in the training set, or both the question and answer. To prevent this, the authors have an intermediate step called 'Recite' where the model is made to recite/output the relevant passages and, after that, output the answer. Just to be clear, there is no doubt now (2023), especially with GPT3/4, LLAMA2 and similar models about the feasibility of this use case, that a model can understand the question, has some ability for causal reasoning, and can generalize to learn a world model from its training data, and to use both to create a well-formed answer to the question. Let's see the difficulties one by one however, of training a large model. First is the importance of the model size. This GIF from the Google AI blog illustrates this beautifully. It is relatively easy and cost-efficient to train or fine-tune a small model with our custom data, as the GPU and infrastructure requirements are very less. On the contrary, it needs huge fleets of GPUs and training infrastructure to load very large language models and fine-tune them (without quantisation) in a distributed way (e.g. see libraries like DeepSpeed) LLMs come in various sizes, based on the number of trainable parameters or weights. The smaller ones, which have less than 1 billion parameters (GPT2 124 M, Bloom 560M, Flan-T5 783 M ) etc can be trained on a laptop GPU with 8 to 15 GB GPU RAM ) For quite some time, this is what I tried. I tried to overfit a small test data set on decoder models like GPP2-small, GPT-Medium, and Bloom and encoder-decoder models like Flan-T5, thinking somehow that the understanding we see in ChatGPT ( see- unsupervised learning Part 1) may come in some form if we train on these smaller models. ( less than one billion parameters). As per the paper, I tried both Causal training, where the model is presented with only previous tokens, and Masked LM-based training, where the model is presented with full tokens, but a certain percentage of tokens are masked in random, and the model has to predict it. The next option was to fine-tune a large model with the data. However, this is extremely difficult to do, and even if cloud-based solutions are used, it would be pretty expensive. (What OpenAI provides now is Instruct Fine-Tuning, which we will cover later) It takes months of GPU fleet time and a specialized library and infrastructure to distribute training across multiple GPUs needed to train LLMs. For example, even a relatively small model like the BigScience Bloom 3 Billion model, even when the weights are loaded in 16 Bit cannot be trained with A100 on ColabPro with 40GB GPU RAM ( the highest you can get) as it goes out of memory. Solution - Fine-Tuning Large Models via Qunaitsation and Parmeter Efficient Tuning The solution to this is to reduce the size of the models so that they can fit a commodity GPU and then fine-tune them. There are two parts to this- Quantisation and Parameter Efficient Tuning. The real magic of this is that a laptop with a sufficient recent GPU (having Tensor Cores), can run the 7 billion Lamma2 pre-trained model open-sourced recently by Meta Research. Imagine the compressed knowledge and an NLU (Natural Language Understanding) model running on your local laptop. This is still a smallish model, but it's still capable of understanding and has sufficient world knowledge embedded in it to be quite useful. Imagine what a model like this or better models in the future could do if it could run in small servers or in cars, and leverage its causal reasoning and world model knowledge to supervise lower-level/specialist AI/ML systems. So we have now a way to fit reasonably large models (7B or more) in a single GPU, via Quantisation and then train them in a parameter-efficient way via LoRa/QLoRa. Take 1: Un-supervised Training Fine-tuning with QLoRa Using the small training data and QLoRA, I first tried to train a large 7B Lamma2 model by feeding in the training text as is (Causal LM model training via UnSupervised learning). Note that this model was loaded in 4-bit, making it runnable on a single T4 GPU and trained with QLoRa. With QLoRA, only a fraction of the adapter weights are trained and summed with the existing frozen pre-trained weights of the model during inference. Here is an illustrative Colab notebook. You can see that training the model with just the text as is, does not result in proper output to questions. The answers are not affected by the training data. Take 2: Instruct Fine-tuning with QLoRa Instruction Tuning concept is a higher-level training concept introduced by this paper FineTuned Language Models Are Zero shot Learners (FLAN) We leverage the intuition that NLP tasks can be described via natural language instructions, such as ""Is the sentiment of this movie review positive or negative?"" or ""Translate 'how are you' into Chinese."" We take a pre-trained language model of 137B parameters and perform instruction tuning ... Since we use QLoRa we are effectively closely following this paper - QLORA: Efficient Finetuning of Quantized LLMs concerning the training data set, the format that the authors used to train their Gauanco model This is the format for the Llama2 model and will be different for others. One of the hardest problems of training is finding or creating a good quality data set to train. In our case, converting the available training data set to the instruction data set. Since our use case is Closed Book QA, we need to convert this to a QA format. Using older NLP methods like NER (Named Entity Recognition) and then using that to create a QA dataset was not effective. This is where the Self-instruct concept could be used However previous to Llama2, the best-performing model was the GPT 3/4 model via ChatGPT or its API and using these models to do the same was expensive. The 7 billion model of Llama2 has sufficient NLU (Natural Language Understanding) to create output based on a particular format. Running this in 4-bit mode via Quantisation makes it feasible compute-wise to run this on a large data set and convert it to a QA dataset. This was the prompt used. The context was a sliding window from the text dataset. Some minimal parsing and finetuning were done on the output of the model, and we could generate a QA dataset of the format below. This was fed to the QLoRA-based fine-tuning (Colab Notebook). We can see that the output from a fine-tuned 4-bit quantized llama2 7 B model is pretty good. Colab Notebook Trying to reduce hallucination via fine-tuning In the generated dataset, I added a specific tag `Source:8989REF`. The idea was that via attention, this token will be somehow associated with the text that we were training on. And then to use this hash somehow to tweak the prompt to control hallucination. Something like ""[INST] <>\nYou are a helpful Question Answering Assistant. Please only answer from this reference Source:8989REF"" However, that turned out to be a very naive attempt. Also, note that the generated QA missed transforming training data related to Professor Thiersch's method to a proper QA dataset. These and other improvements need to be experimented with, as well as to train with some completely new data that the model has not seen to test more effectively. Update: Training with new data was done by writing an imaginary story with ChatGPT help and then creating an instruction tuning data set (colab notebook). The model was then trained and tested (colab notebook) with this generated instruct dataset. The results confirm that the model learns via Instruct tuning, not only the fed questions but other details and relations of the domain. Problems with hallucinations remain (Bordor, Lila characters who are not in the story). The LLama2 13B 4-bit fine-tuned model has better output than the 7B model. A lot more needs to be explored in Fine-tuning. One observation is that slight changes in prompts give different answers. Since the output is not deterministic (that is, with even the same prompt, it varies over time), it is all the more difficult to fine-tune prompts to give the most effective output. This needs to be studied more. Also to be updated are higher level use-cases that should be possible with the fine-tuned models.",https://pub.towardsai.net/exploring-large-language-models-part-3-ab60ee236950#d193,towards_ai -Fine-Tuning a Llama-2 7B Model for Python Code Generation,"New Llama-2 model In mid-July, Meta released its new family of pre-trained and finetuned models called Llama-2, with an open source and commercial character to facilitate its use and expansion. The base model was released with a chat version and sizes 7B, 13B, and 70B. Together with the models, the corresponding papers were published describing their characteristics and relevant points of the learning process, which provide very interesting information on the subject. For pre-training, 40% more tokens were used, reaching 2T, the context length was doubled and the grouped-query attention (GQA) technique was applied to speed up inference on the heavier 70B model. On the standard transformer architecture, RMSNorm normalization, SwiGLU activation, and rotatory positional embedding are used, the context length reaches 4096 tokens, and an Adam optimizer is applied with a cosine learning rate schedule, a weight decay of 0.1 and gradient clipping. The dataset for tuning For our tuning process, we will take a dataset containing about 18,000 examples where the model is asked to build a Python code that solves a given task. This is an extraction of the original dataset [2], where only the Python language examples are selected. Each row contains the description of the task to be solved, an example of data input to the task if applicable, and the generated code fragment that solves the task is provided [3]. Creating the prompt To carry out an instruction fine-tuning, we must transform each one of our data examples as if it were an instruction, outlining its main sections as follows: Output: Fine-tuning the model To carry out this stage, we have used the Google Colab environment, where we have developed a notebook that allows us to run the training in an interactive way and also a Python script to run the training in unattended mode. For the first test runs, a T4 instance with a high RAM capacity is enough, but when it comes to running the whole dataset and epochs, we have opted to use an A100 instance in order to speed up the training and ensure that its execution time is reasonable. In order to be able to share the model, we will log in to the Huggingface hub using the appropriate token, so that at the end of the whole process, we will upload the model files so that they can be shared with the rest of the users. Fine-tuning techniques: PEFT, Lora, and QLora In recent months, some papers have appeared showing how PEFT techniques can be used to train large language models with a drastic reduction of RAM requirements and consequently allowing fine-tuning of these models on a single GPU of reasonable size. The usual steps to train an LLM consist, first, an intensive pre-training on billions or trillions of tokens to obtain a foundation model, and then a fine-tuning is performed on this model to specialize it on a downstream task. In this fine-tuning phase is where the PEFT technique has its purpose. Parameter Efficient Fine-Tuning (PEFT) allows us to considerably reduce RAM and storage requirements by only fine-tuning a small number of additional parameters, with virtually all model parameters remaining frozen. PEFT has been found to produce good generalization with relatively low-volume datasets. Furthermore, it enhances the reusability and portability of the model, as the small checkpoints obtained can be easily added to the base model, and the base model can be easily fine-tuned and reused in multiple scenarios by adding the PEFT parameters. Finally, since the base model is not adjusted, all the knowledge acquired in the pre-training phase is preserved, thus avoiding catastrophic forgetting. Most widely used PEFT techniques aim to keep the pre-trained base model untouched and add new layers or parameters on top of it. These layers are called ""Adapters"" and the technique of their adjustment ""adapter-tuning"", we add these layers to the pre-trained base model and only train the parameters of these new layers. However, a serious problem with this approach is that these layers lead to increased latency in the inference phase, which makes the process inefficient in many scenarios.In the LoRa technique, a Low-Rank Adaptation of Large Language Models, the idea is not to include new layers but to add values to the parameters in a way that avoids this scary problem of latency in the inference phase. LoRa trains and stores the changes of the additional weights while freezing all the weights of the pre-trained model. Therefore, we train a new weights matrix with the changes in the pre-trained model matrix, and this new matrix is decomposed into 2 Low-rank matrices as explained here: Merge the base model and the adapter weights As we mention, we have trained ""modification weights"" on the base model, our final model requires merging the pretrained model and the adapters in a single model. You can find and download the model in my Hugging Face account edumunozsala/llama-27b-int4-python-code-20k. Give it a try! Inferencing or generating Python code And finally, we will show you how you can download the model from the Hugging Face Hub and call the model to generate an accurate result: Thanks to Maxime Labonne for an excellent article [9] and Philipp Schmid who provides an inspiring code [8]. Their articles are a must-read for everyone interested in Llama 2 and model fine-tuning. And it is all I have to mention, I hope you find useful this article and claps are welcome!! You can Follow me and Subscribe to my articles, or even connect to me via Linkedin. The code is available in my Github Repository. References [1] Llama-2 paper [2] Link to the original dataset in the Huggingface hub [3] Link to the used dataset in the Huggingface hub [4] Fine-tuning a GPT - LoRA by Chris Kuo/Dr. Dataman [5] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, & Weizhu Chen. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [6]. QLoRa: Efficient Finetuning of QuantizedLLMs [7] Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning [8] Extended Guide: Instruction-tune Llama 2 by Philipp Schmid. [9] Fine-Tune Your Own Llama 2 Model in a Colab Notebook by Maxime Labonne [10]. My Github Repository",https://pub.towardsai.net/fine-tuning-a-llama-2-7b-model-for-python-code-generation-865453afdf73#bf4e,towards_ai -Foundation Models: Scaling Large Language Models,"New Moore's Laws Achieving Zettascale Computing As the traditional Moore's Law reaches its twilight, new laws are emerging to define the evolution of computing performance. The exponential growth of GPU performance and supercomputer systems has accelerated AI's advancements, with LLMs as a prime example. Despite their extensive training times, these LLMs are benefiting from the rapid growth of computational power. Moore's Law, famously predicting the number of transistors on a microchip would double approximately every two years, is now being replaced by new performance-based laws: GPU performance doubles every 2.2 years, while supercomputer system performance doubles every 1.2 years. These advancements are shaping the way AI/ML technologies progress. Despite the rapidly increasing performance, the training of LLMs still takes anywhere from days to months. This extended duration speaks to the complexity and vast potential of these models. As computational power continues to soar, it will unlock new possibilities for AI research and development. In the coming decade, the AI landscape is set to enter the era of Zettascale Computing. As a result of the new Moore's Laws, AI performance is expected to dramatically outpace other computing advancements. This shift to Zettascale Computing will provide unprecedented processing capabilities, enabling further breakthroughs in AI and other domains. The new Moore's Laws, focusing on GPU and supercomputer performance, herald a new era for AI research and development. With the advent of Zettascale Computing, we can expect even more rapid growth in AI capabilities, impacting various industries and shaping the future of technology. Generative AI Journey with State-of-the-Art LLMs Generative AI (GAI) has experienced rapid advancements in text-to-images, videos, and 3D. But ChatGPT and GPT-4 took the world by storm. These are LLMs-based GAI like other state-of-the-art LLMs: Claude, Bard, LLaMA, ToolFormer, Google USM, PaLM, NeMo, Databricks Dolly, etc. These have revolutionized NLP, enabling a myriad of applications once thought to be unattainable. Despite their impressive capabilities and increasing computing power, LLMs face common challenges such as scalability, training efficiency, and the need for high-quality training data. It reportedly required over 3 million GPU hours across 3072 GPUs to train GPT-3's 175 billion parameters over a period of several months. To address these common challenges, foundation models are emerging as a potential solution. These models aim to provide a solid base for AI development, enabling researchers and developers to build upon them and adapt them for various tasks more efficiently. By focusing on foundation models, the AI community can tackle the limitations posed by scalability, performance, training efficiency, and data quality, ultimately unlocking the full potential of LLMs and other large-scale models (LSMs) in diverse applications. The Era of Foundation Models Foundation models are pre-trained AI models serving as a basis for building diverse applications and tasks. Designed to be versatile, adaptable, and robust, they offer strong leverage across a wide range of use cases. The concept of foundation models was introduced by Stanford with two significant points: emergence and homogenization. Emergence: Referring to the implicit induction of a system's behavior, emergence is a source of both scientific excitement and concern regarding unforeseen consequences. Foundation models learn from vast amounts of data, developing intricate patterns and relationships that can exhibit surprising behaviors.Homogenization: Foundation models consolidate methodologies for building ML systems across various applications. While this homogenization provides strong leverage for many tasks, it also creates single points of failure, raising concerns about resilience and reliability. The astounding success of GAI and human-like ChatGPT has ushered in a new era of foundation models, laying the groundwork for large-scale models and the rise of artificial general intelligence (AGI). Foundation models have emerged to transform the digital world. Their impact is comparable to other milestones in digital evolution, such as the invention of electricity, the advent of the internet, and the rise of cloud computing. By bridging the gap between narrow AI and AGI, foundation models are shaping the future of AI research and development, opening up new possibilities and opportunities in the rapidly evolving digital landscape. Key Characteristics of Foundation Models Foundation models have rapidly become the core of AI. They share several key characteristics, highlighting their potential and significance in shaping the future of AI. Pre-trained and Adaptable: A defining characteristic of foundation models is their pre-trained nature, allowing them to serve as a starting point for various applications and tasks. Through transfer learning and fine-tuning, these models can be adapted to address specific challenges and requirements, significantly reducing development time and resources.Scalability: Designed to be scalable, foundation models can handle vast amounts of data and grow in complexity as required. This scalability enables them to tackle a broad range of tasks and accommodate the ever-increasing demands of the AI landscape.Versatility: Foundation models boast remarkable versatility, as they can be employed across multiple domains and industries. From language and vision to healthcare and finance, these models serve as a basis for a wide range of applications.Self-Supervised Learning: A key aspect of foundation models is their ability to utilize self-supervised learning techniques. By leveraging large-scale, unlabeled data, these models can learn complex representations and features, greatly improving their performance on various tasks and reducing dependence on labeled data.Robustness: Foundation models are known for their robustness, demonstrating resilience in the face of noisy, incomplete, or even adversarial data. This robustness allows them to maintain high levels of performance and accuracy across different contexts and challenges.Interoperability: Interoperability is another critical characteristic of foundation models, as they can be easily integrated with existing systems and frameworks. This seamless integration facilitates collaboration between different AI models and components, streamlining the development process and fostering innovation.Generalization: The ability to generalize is a hallmark of foundation models, enabling them to perform well on unseen data and novel tasks. This characteristic allows them to adapt to a variety of challenges, making them an invaluable asset in AI research and development. By understanding the key characteristics of foundation models, such as their pre-trained nature, adaptability, scalability, versatility, self-supervised learning capabilities, robustness, interoperability, and generalization, we can better appreciate their potential and impact on the future of AI. Capabilities of Foundation Models Beyond LLMs Foundation models have made a significant impact beyond LLMs, offering a versatile and powerful approach to solving complex problems across various domains in language, vision, robotics, reasoning and search, interaction, and philosophy of understanding. Language: Foundation models excel in language, demonstrating human-like comprehension and generation of text. From machine translation and sentiment analysis to summarization and question-answering, these models are unlocking new possibilities in language-related applications and enhancing communication between humans and machines.Vision: In the realm of computer vision (CV), foundation models are transforming the way we analyze and interpret visual data. By effectively recognizing objects, detecting patterns, and segmenting images, these models are enabling advancements in fields such as autonomous vehicles, medical imaging, and surveillance systems.Robotics: By incorporating self-supervised learning and reinforcement learning techniques, foundation models are empowering robots to learn from their environments, adapt to new tasks, and interact more effectively with humans.Reasoning and Search: Foundation models are enhancing our ability to reason and search through vast amounts of data, extracting valuable insights and uncovering hidden connections. Their capabilities extend to logical reasoning, pattern recognition, and knowledge graph exploration, enabling more informed decision-making and efficient problem-solving across numerous industries.Interaction: The interactive capabilities of foundation models facilitate more natural and intuitive communication between humans and machines. By understanding and generating human-like responses, these models pave the way for seamless collaboration and improved user experiences in applications such as chatbots, virtual assistants, and customer support systems.Philosophy of Understanding: At the core of foundation models lies the philosophy of understanding, aiming to uncover the underlying principles and mechanisms that enable machines to comprehend and interpret complex data. The capabilities of foundation models span across language, vision, robotics, reasoning and search, interaction, and philosophy of understanding, highlighting their potential to reshape the AI landscape. By exploring these capabilities, we can foster responsible innovation and unlock the full potential of foundation models in addressing the world's most pressing challenges. AI Engineering AI engineering is a burgeoning discipline combining software engineering principles with AI techniques to design, build, and scale intelligent systems. As large-scale foundation models continue to revolutionize the AI landscape, AI engineering plays a pivotal role in their development and deployment. AI engineering offers the tools and techniques necessary to scale out large-scale models while maintaining their performance and adaptability. Some aspects of scaling out these models through AI engineering include: Distributed Training: AI engineers harness the power of distributed computing to train large-scale models on vast amounts of data, accelerating the training process and improving model performance.Data Management: AI engineers ensure that the data used for training and fine-tuning foundation models is well-organized, clean, and representative of the target domain.Resource Management: AI engineers optimize the use of computational resources, such as GPUs and TPUs, ensuring that large-scale models can be trained and deployed efficiently and cost-effectively.Model Compression and Pruning: AI engineers employ model compression and pruning techniques to reduce the size and complexity of large-scale models, making them more accessible and deployable across various platforms.Monitoring and Maintenance: AI engineers continuously monitor the performance of large-scale models, identifying potential issues and implementing necessary updates and improvements to ensure their ongoing success. AI engineering is an essential discipline for building and scaling foundation models, providing the necessary expertise and techniques to ensure their robustness, efficiency, and adaptability. As we continue to push AI boundaries, AI engineering will play a crucial role in unlocking the full potential of foundation models and shaping the future of AI research and development. TL;DR In closing, foundation models represent a critical milestone in the advancement of AI, providing a versatile and adaptable approach to solving complex problems across multiple domains. From language and vision to robotics and reasoning, these models are unlocking new possibilities and driving innovation across various industries. As we continue to explore the full potential of foundation models and their role in the evolution towards AGI, it is crucial to foster responsible and ethical AI development, ensuring these models are used to benefit humanity and address the most pressing challenges of our time. With foundation models as a solid basis, we can accelerate AI research and development, unlocking new frontiers and shaping the future of intelligent systems. LLMs Papers GPT-4 Technical Report: https://arxiv.org/abs/2303.08774GPT-3: Language Models are Few-Shot Learners: https://arxiv.org/abs/2005.14165Toolformer: Language Models Can Teach Themselves to Use Tools: https://arxiv.org/abs/2302.04761LLaMA: Open and Efficient Foundation Language Models: https://arxiv.org/abs/2302.13971Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages: https://arxiv.org/abs/2303.01037Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model: https://arxiv.org/abs/2201.11990 Foundation Models Resources Reflections on Foundation Models: https://hai.stanford.edu/news/reflections-foundation-modelsOn the Opportunities and Risks of Foundation Models: https://arxiv.org/abs/2108.07258",https://pub.towardsai.net/foundation-models-37074a2d70a1#7ebc,towards_ai -GPTQ Quantization on a Llama 2 7B Fine-Tuned Model With HuggingFace,"GPTQ: Post-training quantization on generative models In a groundbreaking paper [1], researchers unveiled GPTQ, a novel post-training quantization method that has the potential to reshape the world of language model compression. GPTQ is not only efficient enough to be applied to models boasting hundreds of billions of parameters, but it can also achieve remarkable precision by compressing these models to a mere 2, 3, or 4 bits per parameter without sacrificing significant accuracy. This cutting-edge technique is showcased by its ability to quantize massive models, such as OPT-175B and BLOOM-176B, in just a matter of a few GPU hours while maintaining minimal perplexity, a stringent measure of accuracy. On the practical front, the researchers have developed an execution harness that enables efficient operation of the compressed models for generative tasks. Remarkably, they achieved the milestone of running the compressed OPT-175B model on a single NVIDIA A100 GPU, or with only two more cost-effective NVIDIA A6000 GPUs. Additionally, bespoke GPU kernels optimized for compression result in significant speedups, further enhancing the practicality of these compressed models. What makes GPTQ stand out is its ability to quantize language models with hundreds of billions of parameters to the 34 bits/component range. This is a remarkable leap, as prior methods struggled to maintain accuracy below 8 bits and typically focused on smaller models. However, the study also highlights the complex tradeoffs between perplexity, bit-width, and model size induced by compression. But it comes with limitations. GPTQ does not currently offer speedups for actual multiplications due to the lack of hardware support for mixed-precision operands on mainstream architectures. Activation quantization is also not included in the current results but can be addressed through orthogonal techniques. In sum, GPTQ's ability to compress extremely accurate language models to unprecedented levels marks a significant milestone in the field of machine learning and language modeling. It paves the way for more efficient and accessible applications of these colossal models while pointing toward further research possibilities in the realm of model compression. ¿When you should use GPTQ? The answer to this question will depend on each specific case and on the base model to be used, but an approach that is being applied to numerous models and that is indicated by HuggingFace, and the article I mentioned before, is the following: Fine-tune the original LLM model with bitsandbytes in 4-bit, nf4, and QLoRa for efficient fine-tuning.Merge the adapter into the original modelQuantize the resulting model with GPTQ 4-bit I ran the first two steps in my previous article [3] and now that the AutoGPT library is integrated with the Huggingface ecosystem, we will execute the third step in an extremely simple way. AutoGPT integrated with Hugging Face transformers The AutoGPTQ library emerges as a powerful tool for quantizing Transformer models, employing the efficient GPTQ method. Some efforts like GPTQ-for-LLaMa, Exllama, and llama.cpp, focuses on the quantization of the Llama architecture, but AutoGPTQ distinguishes itself by offering seamless support for a diverse array of transformer architectures. The Hugging Face team has taken a significant step to enhance accessibility to GPTQ, and they have integrated an inclusive Transformers API, simplifying the process of Low-Level Model (LLM) quantization for a wider audience. This integration includes essential optimization options, such as CUDA kernels, catering to common use cases. For users seeking more advanced quantization options, the Auto-GPTQ library remains a valuable resource, offering capabilities like Triton kernels and fused-attention compatibility, and ensuring versatility and adaptability in the world of transformer model quantization. Extracted from the Huggingface blog article ""Making LLMs lighter with AutoGPTQ and transformers"" [5]. Our approach to this task First, we will load our fine-tuned model Llama 2 7B 4-bit Python coder in a Colab session using a T4 with extra RAM. The model is loaded in 4-bit with bitsandbytes and then we execute about 12 examples to measure the inference time. In order to perform a simple evaluation of the performance at the inference time, we have taken as examples those whose input text was longer than 500 characters and in this way, we will try to better appreciate the impact of quantization during inference. You can extract the code to load this model in the description of the model in the hugging Face Hub. In my notebook, we will describe how to perform inference on the examples mentioned. Quantize the model using auto-gptq, transformers, and optimum The GPTQ quantization consumes a lot of GPU VRAM, for that reason we need to execute it in an A100 GPU in Colab. It takes about 45 minutes to quantize the model, less than $1 in Colab. You can find the code in this notebook in my repository. First, we need to install the libraries as it is recommended in the huggingface tutorial: Optimum library, Hugging Face's toolkit for training and inference optimization, provides the integration of AutoGPTQ into Transformers. The GPTQ algorithm requires calibrating the quantized weights of the model by making inferences on the quantized model. For quantizing a model using auto-gptq, we need to pass a dataset to the quantizer. This can be achieved either by passing a supported default dataset among ['wikitext2','c4','c4-new','ptb','ptb-new'] or a list of strings that will be used as your custom dataset. Now you just need to load the model using a GPTQ configuration setting the desired parameters, as usual when working with transformers, it is very easy: As mentioned, this code takes about 45 minutes to run and consumes a peak of 32 GB of GPU VRAM. ""You will need a GPU to quantize a model. We will put the model in the CPU and move the modules back and forth to the GPU in order to quantize them. If you want to maximize your GPUs usage while using CPU offload, you can set device_map = ""auto"" "" [6], hugging Face docs. Parameters are self-explained, 4-bit quantization, C4 dataset, and the tokenizer to use during quantization. The other two parameters take the default values: group_size: The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantizationdesc_act: Whether to quantize columns in order of decreasing activation size. Setting it to False can significantly speed up inference but the perplexity may become slightly worse. Also known as act-order. Once you have your model quantized, it is time to upload it to the Huggin Face Hub and share it with the community. In my experiment using GPTQ, the reduction in model size is striking. My fine-tuned Llama 2 7B model with 4-bit weighted 13.5 GB on disk, but after quantization, its size was dramatically reduced to just 3.9 GB, a third of the original size. This feature is very attractive when deploying large language models. Loading the GPTQ Model from Hugging Face Hub and making some inferences Probably, all of you know how to do that but just in case you think this could be more ""trickier"" than with other models, we will show you that it is as usual. Remember you need to load all the libraries, including optimum, accelerate, and, of course, auto-gptq . Then you can upload the tokenizer and the model into your notebook in a T4 GPU in Google Colab: Now we can check our GPU to confirm how much memory we are consuming and, indeed, we can see that the model occupies 5,053 GB. We repeat the performance evaluation we mentioned earlier, making inferences on a bunch of long examples to compare with the original model. Both inference processes were executed on a T4 GPU, the base model took about 1719 seconds per inference while the quantized model ran in about 8 to 9 seconds per inference, a half. All code and examples are well-explained in his notebook in my repository. Any suggestion or bug fixing is welcome. References [1] ICLR 2023 paper ""GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers"" [2] ""GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs - Examples with Llama 2"" by Benjamin Marie. [3] ""Fine-Tuning a Llama 2 7B Model for Python Code Generation"" by Eduardo Muñoz [4] Original fine-tuned model in Huggingface ""edumunozsala/llama-27b-int4-python-code-20k"" [5] Hugging Face blog article ""Making LLMs lighter with AutoGPTQ and transformers"" [6] Hugging Face official documentation about GPTQConfig [7] ""4-bit Quantization with GPTQ"" by Maxime Labonne.",https://pub.towardsai.net/gptq-quantization-on-a-llama-2-7b-fine-tuned-model-with-huggingface-a7b291fbb871#34d2,towards_ai -LLaMA by Meta leaked by an anonymous forum: Questions Arises on Meta,"LLaMA: Meta's new AI tool According to the official release, LLaMA is a foundational language model developed to assist 'researchers and academics' in their work (as opposed to the average web user) to understand and study these NLP models. Leveraging AI in such a way could give researchers an edge in terms of time spent. You may not know this, but this would be Meta's third LLM after Blender Bot 3 and Galactica. However, the two LLMs were shut down soon, and Meta stopped their further development, as it produced erroneous results. Before moving further, it is important to emphasize that LLaMA is NOT a chatbot like ChatGPT. As I mentioned before, it is a 'research tool' for researchers. We can expect the initial versions of LLaMA to be a bit more technical and indirect to use as opposed to the case with ChatGPT, which was very direct, interactive, and a lot easy to use. ""Smaller, more performant models such as LLaMA enable ... research community who don't have access to large amounts of infrastructure to study these models.. further democratizing access in this important, fast-changing field,"" said Meta in its official blog. Meta's effort of ""democratizing"" access to the public could shed light on one of the critical issues of Generative AI - toxicity and bias. ChatGPT and other LLMs (obviously, I am referring to Bing) have a track record of responding in a way that is toxic and, well... evil. The Verge and major critics have covered it in much detail. Oh and the community did get the access, but not in the way Meta anticipated. On March 3rd, a downloadable torrent of the LLaMA system was posted on 4chan. 4chan is an anonymous online forum known for its controversial content and diverse range of discussions, which has nearly 222 million unique monthly visitors. LLaMA is currently not in use on any of Meta's products. But Meta has plans to make it available to researchers before they can use them in their own products. It's worth mentioning that Meta did not release LLaMA as a public chatbot. LLaMA is more of an open-source package that can be accessed by trusted authorities upon request. Powerful LLMs: What to hope Whether to agree with Ladish's views or not is debatable. Personally, I feel open-sourcing AI models could only benefit the AI community to scrutinize the model and improve them for the better. What do you think? After all, one of LLaMA's major goals is to 'democratize' access to such models. But this access in the form of a leak put Meta into question - how it handles its tools and conducts release in public? Most of the users that got the leaked copies soon discovered that LLaMA was not at all similar to ChatGPT. ""Downloading"" LLaMA is going to do very little for the average internet user because it's a ""raw"" AI system that needs a decent amount of technical expertise to get up and running. However, as I am writing this, Meta hasn't acknowledged the leak to the public yet. Neither did they comment on it. There are both positive and negative consequences to this leak. On the one hand, unrestricted access to Llama could help researchers understand how and why large language models work, which could lead to improvements in robustness, bias, and the toxic nature of LLMs. This could really help in reducing the potential for generating misinformation by these troublesome machines. On the other hand, however, the leak could lead to people misusing the model itself. It is not yet perfect. Hence Meta hasn't released it fully to the public yet. Risks such as spam and phishing could be really hard to tackle if such superintelligent machines are put to the test. Thus, much safeguard must be applied to the use of these models. We can see such tools, like OpenAI Text Classifier, emerging. So there is a positive hope for this. AI is exciting, no doubt. But a lot scarier if we lose our control over it.",https://pub.towardsai.net/llama-by-meta-leaked-by-an-anonymous-forum-questions-arises-on-meta-e1216e51db6#9001,towards_ai -LLaMA-GPT4All: Simplified Local ChatGPT,"Introduce GPT4All GPT4All is a large language model (LLM) chatbot developed by Nomic AI, the world's first information cartography company. It was fine-tuned from LLaMA 7B model, the leaked large language model from Meta (aka Facebook). GPT4All is trained on a massive dataset of text and code, and it can generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. GPT4All is available to the public on GitHub. LLaMA is available for commercial use under the GPL-3.0 license - while the LLaMA code is available for commercial use, the WEIGHTS are not. This effectively puts it in the same license class as GPT4All. Nomic is working on a GPT-J-based version of GPT4All with an open commercial license. GPT4All is not going to have a subscription fee ever. GPT4All is Free4All. Although GPT4All is still in its early stages, it has already left a notable mark on the AI landscape. Its popularity and capabilities are expected to expand further in the future. How to Run GPT4All Locally GPT4All Readme provides some details about its usage. Here will briefly demonstrate to run GPT4All locally on M1 CPU Mac. Download gpt4all-lora-quantized.bin from the-eye.Clone this repository, navigate to chat, and place the downloaded file there. Simply run the following command for M1 Mac: Now, it's ready to run locally. Please see a few snapshots below: Similar to ChatGPT, GPT4All has the ability to comprehend Chinese, a feature that Bard lacks. If you want to interact with GPT4All programmatically, you can install the nomic client as follows. Install the nomic client using pip install nomic.Use the following Python script to interact with GPT4All: Chat4All Demystified GPT4All aims to provide a cost-effective and fine-tuned model for high-quality LLM results. The GPT4All model was fine-tuned using an instance of LLaMA 7B with LoRA on 437,605 post-processed examples for 4 epochs. Detailed model hyperparameters and training codes can be found in the GitHub repository. GPT4All developers collected about 1 million prompt responses using the GPT-3.5-Turbo OpenAI API from various publicly available datasets. After an extensive data preparation process, they narrowed the dataset down to a final subset of 437,605 high-quality prompt-response pairs. Developing GPT4All took approximately four days and incurred $800 in GPU expenses and $500 in OpenAI API fees. The final gpt4all-lora model can be trained on a Lambda Labs DGX A100 8x 80GB in about 8 hours, with a total cost of $100. A preliminary evaluation of GPT4All compared its perplexity with the best publicly known alpaca-lora model. Results showed that the fine-tuned GPT4All models exhibited lower perplexity in the self-instruct evaluation compared to Alpaca. However, this assessment was not exhaustive due to encouraging users to run the model on local CPUs to gain qualitative insights into its capabilities. TL;DR Considering the expensive LLMs in training and serving, Meta LLaMA is a foundation for accelerating LLM open-source community. Stanford's Alpaca, based on LLaMA, offers an optimized smaller model with enhanced performance. Now, GPT4All, also built on LLaMA, enables local execution. Generative AI is evolving rapidly every day. Thanks to Brandon Duderstadt for reviewing this article.",https://pub.towardsai.net/llama-gpt4all-simplified-local-chatgpt-ab7d28d34923#485a,towards_ai -Inside Code Llama: Meta AI's Entrance in the Code LLM Space,"Inside Code Llama The release of Code Llama does not include a single model but three different variants, characterized by their parameter sizes of 7B, 13B, and 34B. Each of these models has been trained on an extensive pool of 500B tokens encompassing code and code-related information. Notably, the 7B and 13B base and instruct models have been endowed with fill-in-the-middle (FIM) competence, empowering them to seamlessly insert code into existing code structures. This attribute equips them to handle tasks like code completion right from the outset.The trio of models caters to distinct requisites concerning serving and latency. For instance, the 7B model boasts the ability to operate on a single GPU. While the 34B model stands out for yielding optimal outcomes and elevating coding assistance, the smaller 7B and 13B versions excel in speed, making them fitting for low-latency tasks such as real-time code completion. Meta AI's innovations further extend to two nuanced adaptations of Code Llama: Code Llama - Python and Code Llama - Instruct. Code Llama - Python is a specialized derivation, meticulously honed on a substantial volume of Python code spanning 100B tokens. Given Python's central role in code generation benchmarks and its significance within the AI community, this focused model augments utility.Code Llama - Instruct represents an alignment and refinement of Code Llama through instructional fine-tuning. This novel training approach entails furnishing the model with ""natural language instruction"" inputs paired with anticipated outputs. This strategic methodology enhances the model's capacity to grasp human expectations in prompts. For endeavors involving code generation, it is advised to opt for Code Llama - Instruct versions, as they have been calibrated to yield useful and secure natural language responses. Deep diving into the Code Llama training and fine-tuning, there are a few aspects that are worth highlighting 1) DatasetLlama's training rests on a meticulously curated dataset enriched with publicly available code, offering a near-duplicate-free landscape. The dataset consists of 500B tokens during the initial phase, starting from the 7B, 13B, and 34B versions. A supplementary 8% of sample data is garnered from natural language datasets linked to code domains. 2) InfillingWithin the realm of Code Infilling, a pivotal task revolves around predicting missing segments within a program while being guided by contextual surroundings. Pragmatic applications encompass code completion within Integrated Development Environments (IDEs), type inference, and even the generation of in-code documentation such as docstrings. Operating in alignment with the concept of causal masking, a framework expounded by Aghajanyan et al. (2022) and Fried et al. (2023), Meta AI molds infilling models. The training process entails shifting parts of training sequences to the conclusion, paving the path for autoregressive predictions. In this endeavor, both the versatile 7B and 13B models undergo infilling-oriented training, echoing the strategies advised by Bavarian et al. (2022). 3) Long Context Fine-Tuning:Unraveling the intricacies of handling extensive sequences is a formidable pursuit in the realm of transformer-based language models. The pivotal challenges orbit around extrapolation - delving into sequence lengths beyond those encountered during training - and the quadratic complexity of attention passes that tilts the balance towards short-to-medium inputs for effective training. Meta AI steps forward with a unique solution, introducing the dedicated domain of long context fine-tuning (LCFT). Embracing sequences encompassing 16,384 tokens, a substantial leap from the 4,096 tokens featured in Llama 2's initial code training stages, LCFT empowers models with extended-range capabilities. This strategic shift occurs within a fine-tuning phase, circumventing undue escalation in training costs. 4) Instruction Fine-Tuning:Code Llama's prowess extends to instruction fine-tuning, witnessed in the refined Code Llama - Instruct models. This iteration leverages Code Llama as its foundation, sculpted to aptly respond to queries. Merging Supervised Fine-Tuning with an expansive pool of Rejection Sampling examples yields this instructive competence. 5) Self-InstructIn the realm of datasets, Meta AI embarks on a proprietary journey, curating instances tethered to code-related tasks. In recognition of the resource-intensive nature of acquiring data from human annotators or through human feedback, a particular emphasis on self-instruction is embraced. The domain of coding tasks, steeped in the insights of professional developers, forms the canvas on which this innovative approach is painted. The Results The evaluate Code Llama, Meta AI engaged two widely acknowledged coding benchmarks: HumanEval and Mostly Basic Python Programming (MBPP). The HumanEval benchmark systematically assesses the model's prowess in code completion via docstrings, while the MBPP benchmark scrutinizes the model's capacity to translate descriptions into executable code.The meticulous benchmarking endeavor unfolded illuminating results: Code Llama outshone open-source, code-centric Large Language Models (LLMs) and even outperformed its predecessor, Llama 2. For instance, in the case of Code Llama 34B, remarkable scores emerged - an impressive 53.7% on the HumanEval benchmark and a formidable 56.2% on the MBPP benchmark. These scores stood as the highest amongst comparable state-of-the-art solutions, positioning Code Llama 34B on par with the notable capabilities of ChatGPT. Code Llama promises to be one of the most important code LLMs in the near future. It certainly contributes to reaffirm the value of open-source foundation models across different domains.",https://pub.towardsai.net/inside-code-llama-meta-ais-entrance-in-the-code-llm-space-9f286d13a48d#c9e0,towards_ai -Meta's Llama 2: Revolutionizing Open Source Language Models for Commercial Use,"I. Llama 2: Revolutionizing Commercial Use Unlike its predecessor Llama 1, which was limited to research use, Llama 2 represents a major advancement as an open-source commercial model. Businesses can now integrate Llama 2 into products to create AI-powered applications. Availability on Azure and AWS facilitates fine-tuning and adoption. However, restrictions apply to prevent exploitation. Companies with over 700 million active daily users cannot use Llama 2. Additionally, its output cannot be used to improve other language models. II. Llama 2 Model Flavors Llama 2 is available in four different model sizes: 7 billion, 13 billion, 34 billion, and 70 billion parameters. While 7B, 13B, and 70B have already been released, the 34B model is still awaited. The pretrained variant, trained on a whopping 2 trillion tokens, boasts a context window of 4096 tokens, twice the size of its predecessor Llama 1. Meta also released a Llama 2 fine-tuned model for chat applications that was trained on over 1 million human annotations. Such extensive training comes at a cost, with the 70B model taking a staggering 1720320 GPU hours to train. The context window's length determines the amount of content the model can process at once, making Llama 2 a powerful language model in terms of scale and efficiency. III. Safety Considerations: A Top Priority for Meta Meta's commitment to safety and alignment shines through in Llama 2's design. The model demonstrates exceptionally low AI safety violation percentages, surpassing even ChatGPT in safety benchmarks. Finding the right balance between helpfulness and safety when optimizing a model poses significant challenges. While a highly helpful model may be capable of answering any question, including sensitive ones like ""How do I build a bomb?"", it also raises concerns about potential misuse. Thus, striking the perfect equilibrium between providing useful information and ensuring safety is paramount. However, prioritizing safety to an extreme extent can lead to a model that struggles to effectively address a diverse range of questions. This limitation could hinder the model's practical applicability and user experience. Thus, achieving an optimum balance that allows the model to be both helpful and safe is of utmost importance. To strike the right balance between helpfulness and safety, Meta employed two reward models - one for helpfulness and another for safety - to optimize the model's responses. The 34B parameter model has reported higher safety violations than other variants, possibly contributing to the delay in its release. IV. Helpfulness Comparison: Llama 2 Outperforms Competitors Llama 2 emerges as a strong contender in the open-source language model arena, outperforming its competitors in most categories. The 70B parameter model outperforms all other open-source models, while the 7B and 34B models outshine Falcon in all categories and MPT in all categories except coding. Despite being smaller, Llam a2's performance rivals that of Chat GPT 3.5, a significantly larger closed-source model. While GPT 4 and PalM-2-L, with their larger size, outperform Llama 2, this is expected due to their capacity for handling complex language tasks. Llama 2's impressive ability to compete with larger models highlights its efficiency and potential in the market. However, Llama 2 does face challenges in coding and math problems, where models like Chat GPT 4 excel, given their significantly larger size. Chat GPT 4 performed significantly better than Llama 2 for coding (HumanEval benchmark)and math problem tasks (GSM8k benchmark). Open-source AI technologies, like Llama 2, continue to advance, offering strong competition to closed-source models. V. Ghost Attention: Enhancing Conversational Continuity One unique feature in Llama 2 is Ghost Attention, which ensures continuity in conversations. This means that even after multiple interactions, the model remembers its initial instructions, ensuring more coherent and consistent responses throughout the conversation. This feature significantly enhances the user experience and makes Llama 2 a more reliable language model for interactive applications. In the example below, on the left, it forgets to use an emoji after a few conversations. On the right, with Ghost Attention, even after having many conversations, it will remember the context and continue to use emojis in its response. VI. Temporal Capability: A Leap in Information Organization Meta reported a groundbreaking temporal capability, where the model organizes information based on time relevance. Each question posed to the model is associated with a date, and it responds accordingly by considering the event date before which the question becomes irrelevant. For example, if you ask the question, ""How long ago did Barack Obama become president?"", its only relevant after 2008. This temporal awareness allows Llama 2 to deliver more contextually accurate responses, enriching the user experience further. VII. Open Questions and Future Outlook Meta's open-sourcing of Llama 2 represents a seismic shift, now offering developers and researchers commercial access to a leading language model. With Llama 2 outperforming MosaicML's current MPT models, all eyes are on how Databricks will respond. Can MosaicML's next MPT iteration beat Llama 2? Is it worthwhile to compete with Llama 2 or join hands with the open-source community to make the open-source models better? Meanwhile, Microsoft's move to host Llama 2 on Azure despite having significant investment in ChatGPT raises interesting questions. Will users prefer the capabilities and transparency of an open-source model like Llama 2 over closed, proprietary options? The stakes are high, as Meta's bold democratization play stands to reshape preferences and partnerships in the AI space. One thing is certain - the era of open language model competition has begun. VIII. Conclusion With the launch of Llama 2, Meta has achieved a landmark breakthrough in open-source language models, unleashing new potential through its commercial accessibility. Llama 2's formidable capabilities in natural language processing, along with robust safety protocols and temporal reasoning, set new benchmarks for the field. While select limitations around math and coding exist presently, Llama 2's strengths far outweigh its weaknesses. As Meta continues honing Llama technology, this latest innovation promises to be truly transformative. By open-sourcing such an advanced model, Meta is propelling democratization and proliferation of AI across industries. From healthcare to education and beyond, Llama 2 stands to shape the landscape by putting groundbreaking language modeling into the hands of all developers and researchers. The possibilities unlocked by this open-source approach signal a shift towards a more collaborative, creative AI future.",https://pub.towardsai.net/metas-llama-2-revolutionizing-open-source-language-models-for-commercial-use-1492bec112b#148f,towards_ai -The Generative AI Revolution: Exploring the Current Landscape,"What is Generative AI? Generative AI is a subfield of machine learning that involves training artificial intelligence models on large volumes of real-world data to generate new contents (text, image, code,...) that is comparable to what humans would create. This is achieved by training algorithms on large datasets to identify patterns and learn from them. Once the neural network has learned these patterns, it can generate new data that adheres to the same patterns. However, this process is computationally intensive. Fundamentally, a generative AI for NLP applications will process an enormous corpus on which it has been trained and respond to prompts with something that falls within the realm of probability, as learnt from the mentioned corpus. For example, autocomplete is a low-level form of generative AI. Advanced models like ChatGPT and DALL-E take the concept to a whole new level. Different model architectures, such as diffusion models and Transformer-based large language models (LLMs), can be employed for generative tasks such as image and language generation. Diffusion models are a type of generative AI model that can be used for a variety of tasks, including image generation, image denoising, and inpainting. Similarly, the Transformer architecture revolutionized the language domain. The new era of language models are Transformer-based, which is a type of deep learning architecture for natural language processing (NLP) tasks. They utilize a self-attention mechanism to transform the input sequence into a set of context-aware high dimensional vectors (also known as embeddings) that can be used for a variety of NLP tasks, including language generation, machine translation, and text classification. The most well-known transformer-based LLMs are the GPT family, developed by OpenAI. The primary advantage of transformer-based LLMs over traditional NLP models is that they are highly parallelizable and can handle long-range dependencies between words in a sentence more effectively. This makes them more suitable for tasks that require a deeper understanding of the context, such as text summarization or generating a coherent and fluent text. Let's explore the history and current state of generative AI and the key players shaping its future. The Generative AI Revolution Generative AI has been around for several years. One of the earliest examples is the Eliza chatbot developed by Joseph Weizenbaum in 1966. However, these early implementations relied on a rules-based approach that had several shortcomings, such as a limited vocabulary, lack of context, and overreliance on patterns. As a result, they were prone to frequent breakdowns, making it difficult to customize and expand these initial chatbots. Recently, significant progress has been made in AI and machine learning, resulting in the development of advanced generative AI systems. It's no coincidence that these breakthroughs have happened all at once. They're based on a new class of AI models that are incredibly flexible and powerful, surpassing anything we've seen before. In deep learning, there are three critical components that contributed the most to their recent success: scaling models, large datasets, and more compute power - all working together to bring us to this exciting point in AI advancement. Progress in GPUs and their application to Machine Learning GPUs are designed for parallel processing, making them well-suited for the computationally intensive tasks involved in training deep neural networks. Unlike CPUs, which focus on sequential processing, GPUs have thousands of smaller cores that can handle multiple tasks simultaneously, allowing for faster training of large networks. A key breakthrough for machine learning was the intuition that GPUs could be used for Neural Networks, together with software progress such as Nvidia's release of CUDA in 2007, a programming language that allowed GPUs to be used as general-purpose computers. Alexnet - 2012 - The Deep Learning Revolution The modern AI revolution began in 2012 with step change progress in deep learning and convolutional neural networks (CNNs), which were particularly effective in solving computer vision problems. Although CNNs had been around since the 1990s, they were not practical due to their intensive computing power requirements. However, In 2009, Stanford AI researchers introduced ImageNet, a labeled image dataset used to train computer vision algorithms, and a yearly challenge. In 2012, AlexNet combined CNNs trained on GPUs with ImageNet data to create the most advanced visual classifier at the time. The model outperformed the runner-up by a significant margin of nearly 11%! The success of CNNs, the ImageNet dataset, and GPUs drove significant progress in computer vision. Transformers: Attention Is All You Need (Google) - 2017 One critical area where deep learning lagged was natural language processing (NLP), which involves getting computers to understand and hold a coherent conversation with humans rather than translation or classification. NLP breakthroughs were needed to bridge this gap. Previously, researchers relied on models such as recurrent neural networks (RNNs) and long short-term memory (LSTM) to process and analyze time-based data. These models were proficient at recognizing short sequences such as spoken words but struggled with longer sentences and paragraphs. The architectural flaws of these models was unable to capture the complexity and richness of ideas that arise when sentences are combined into larger bodies of text. A significant breakthrough in AI was the development of the ""Transformer"" model by Google with the very popular paper ""Attention Is All You Need"". This model represented a major milestone as it revolutionized the approach to translation problems by utilizing a mechanism called ""attention"": a particular neural network that allowed the model to analyze the entire input sequence and determine relevance to each component of the output. In the years to come, Transformers have been found to be state-of-the-art models for many other NLP tasks as well, and recently also in other domains such as computer vision. Next word prediction, scale and fine tuning - BERT (Google) and GPT (OpenAI) family - 2018 With the advancement of Transformers, a key further breakthrough finding was the potential to train on unstructured data via next word prediction objective on website contents. It introduced the models such as BERT and GPT-2. This delivered surprising capabilities and ""zero shot"" performance at completing new tasks the model hadn't been trained for. OpenAI also continued to probe the ability for the performance of these models to continue increasing with more scale and more training data. One of the major challenges faced by researchers was acquiring the right training data. ImageNet, a collection of one hundred thousand labeled images, required a significant human effort. Despite the abundance of text available on the Internet, creating a meaningful dataset for teaching computers to work with human language beyond individual words is a time-consuming process. Additionally, labels created for one application using the same data may not apply to another task. With the advancements of BERT and first iteration of GPT, we started to harness the immense amount of unstructured text data available on the internet and the computational power of GPUs. OpenAI further advanced this approach with their development of GPT-2 and GPT-3 models, which are short for ""generative pre-trained transformer."" These models are specifically designed to generate new words in response to input and are pre-trained on vast amounts of text using the next world prediction objective. Another key breakthrough in these large transformer models is the concept of ""fine tuning"" - or adapting a large model to new more specific tasks or a new smaller and targeted data set - to improve performance in a particular domain with far lower compute cost than training a new model from scratch. For example, a foundational language model like GPT-3 may be fine-tuned on a dataset of medical documents to create an instruction-tuned model for medical document processing. This model will be better at understanding medical terminology, identifying medical entities, and extracting relevant information from medical texts. Instruction Tuning - Instruct GPT and ChatGPT (OpenAI) - 2022 The most recent advancement which has led to the Generative AI landscape today is the concept of Instruction Tuning - taking a model which has just been trained to predict the next word of a text document - and teaching it (via fine tuning) to actually follow human instructions and preferences. This made it far easier to interact with these LLMs and to get them to answer questions and perform tasks without getting sidetracked by just trying to predict the next word. A fortunate feature of instruction tuning is that not only it helps to increase the accuracy and capabilities of these models, but they also help align them to human values and helps prevent them from generating undesired or dangerous content. OpenAI's specific technique for instruction tuning is called reinforcement learning with human feedback (RLHF) where humans are used to train the model by ranking its responses. Building on top of Instruction Tuning, OpenAI released ChatGPT - which reorganized instruction tuning into a dialogue format and created an easy to use interface for interacting with the AIs. This has catalyzed the mass awareness and adoption of Generative AI products and has led to the landscape we have today. The Current LLM Landscape The breakthroughs in Generative AI have left us with an extremely active and dynamic landscape of players. This consists of 1) AI hardware manufacturers such as Nvidia and Google, 2) AI cloud platforms such as Azure, AWS, Nvidia and Google, 3) open source platforms for accessing the full models, such as Hugging Face, 4) access to LLM models via API such as OpenAI, Cohere and Anthropic and 5) access to LLMs via consumer products such as ChatGPT and Bing. Additionally, there are many more breakthroughs happening each week in this universe such as the release of multi modal models (that can understand both text and image), new model architectures (such as Mixture of Experts), Agent Models (models that can set tasks and interact with each other and other tolls). This all leads to many questions such as; How will most people interact with LLMs?Who will be the leading players going forward?How fast will the capabilities of these models keep progressing?Are open source models dangerous because of the lack of control of their outputs and use, or are they beneficial due to democratizing access to this technology? 1. OpenAI's GPT Models Notable Models Task specific models Find model information here: https://platform.openai.com/docs/models/gpt-3 Image & Audio Models OpenAI, the company behind the GPT models, is an AI research and deployment company. The San Francisco-based lab was founded in 2015 as a nonprofit with the goal of building ""artificial general intelligence"" (AGI), which is essentially software as smart as humans. OpenAI conducts innovative research in various fields of AI, such as deep learning, natural language processing, computer vision, and robotics, and develops AI technologies and products intended to solve real-world problems. OpenAI transitioned into a for-profit company in 2019. The company plans to cap the profit of the investors at a fixed multiple of their investment (noted by Sam Altman as currently ranging between 7x and 100x depending on the investment round date and risk). As per the WSJ OpenAI was initially funded by $130m of charity funding (Elon Musk tweeted he contributed $100m) and has since raised at least $13bn led by Microsoft (where OpenAI makes use of Azure cloud credits). With the Microsoft partnership, OpenAI's ChatGPT, along with Microsoft's own search AI, created an improved version of Bing and transformed Microsoft's Office productivity apps. In 2019, OpenAI released GPT-2, a model that could generate realistic human-like text in entire paragraphs with internal consistency, unlike any of the previous models. The next generation, GPT-3, launched in 2020, was trained with 175 billion parameters. GPT-3 is a multi-purpose language tool that users can access without requiring them to learn a programming language or other computer tools. In November 2022, OpenAI released ChatGPT, which is a superior version of the company's earlier text generation models with the capability to generate humanlike prose. After the success of ChatGPT (GPT 3.5), Open AI released GPT-4 in March 2023, which has multimodal capabilities. The model processes both image and text inputs for text generation. The model has a maximum token count of 32,768 capable of generating around 25,000 words as compared to GPT-3.5 which has 4,096 tokens context size. GPT-4 produces 40% more factual responses and its response rate for disallowed content is down by 82% as compared to previous models. (reported by OpenAI) 2. Google's Palm Models Google AI, formerly known as Google Research, is the AI research and development arm of Google. It was unveiled at Google I/O 2018. Google has contributed many of the most significant papers in breakthroughs in modern machine learning. Google's largest publicly disclosed model is its Pathways Language Model (PaLM) which has likely recently been rolled out in its Bard chatbot. PaLM has been used as a foundation model in several Google projects including the instruction tuned PaLM-Flan, and the recent PaLM-E (the first ""embodied"" multimodal language model). The pre-training of PaLM involved self-supervised learning drawing from a large text corpus that included multilingual web pages (27%), English books (13%), open-source code repositories, and source code from GitHub (5%), multilingual Wikipedia articles (4%), English news articles (1%), and other social media conversations (50%). PaLM excelled in 28 out of 29 NLP tasks in the few-shot performance, beating the prior larger models like GPT-3 and Chinchilla. PaLM variants scale up to 540 billion parameters (vs GPT-3 at 175 billion) and trained on 780 billion tokens (vs GPT-3 300bn) - totalling around 8x more compute training than GPT-3 (but likely considerably less than GPT-4). PaLM was trained across multiple TPU v4 pods. Being a dense decoder-only Transformer model, PaLM is trained on two TPU V4 pods connected over a data center network and uses a combination of model and data parallelism. Researchers used 3072 TPU v4 chips in each pod, attached to 768 hosts. This large TPU configuration allows for efficient scale training without using pipeline parallelism. The Pathways system allows for scaling a model across Google's thousands of Tensor Processing Unit chips. 3. DeepMind's Chinchilla Model DeepMind Technologies, founded in 2010, is a British AI research laboratory. It became a wholly owned subsidiary of Alphabet Inc., in 2015 after its acquisition by Google in 2014. DeepMind has created a neural network or a Neural Turing machine that tries to replicate the short-term memory of the human brain. In 2016, DeepMind's AlphaGo program defeated a human professional Go player, and their program AlphaZaro defeated the most powerful programs in the games of Go and Shogi. The program acquired competence using reinforcement learning. In 2020, DeepMind's program AlphaFold started making advances in the problem of protein folding and by July 2022, it had predicted over 200 million protein structures. In April 2022, Flamingo, a single visual language model program capable of describing any picture, was launched. Three months later, in July 2022, DeepNash was announced; as a model-free multi-agent reinforcement learning system. DeepMind developed a language model called Chinchilla AI in March 2022, which claimed to outperform GPT-3. A key breakthrough in the Chinchilla paper was that previous LLMs had been trained on too little data - for a given parameter size the optimum model should use far more training data than GPT-3. While more training data takes more time to gather, and leads to more training costs, achieving more capable models for a smaller parameter size has huge benefits for inference costs (the costs needed to run and use the finished model which scale with parameter size). Chinchilla has 70B parameters (60% smaller than GPT-3) and was trained on 1,400 tokens (4.7x GPT-3). The average accuracy rate of Chinchilla AI is 67.5% on Measuring Massive Multitask Language Understanding (MMLU) and outperforms other large language model platforms like Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (300 parameters and 530B parameters) on a large range of downstream evaluation tasks. 4. Microsoft & Nvidia's Megatron Turing Model Nvidia is a company that designs GPUs and APIs for data science and high-performance computing, and SoCs for mobile computing and the automotive market. The company is a leading supplier of AI hardware and software. Additionally, Nvidia's CUDA API enables the creation of massively parallel programs that leverage GPUs. Developed by NVIDIA's Applied Deep Learning Research team in 2021, the Megatron-Turing model consists of 530 billion parameters and 270 billion training tokens. Nvidia has provided access via an Early Access program for its managed API service to its MT-NLG model. Nvidia has made many of its LLM and Generative AI models and services available through its new DGX Cloud platform. 5. Meta's LlaMa Models Meta AI, formerly known as Facebook Artificial Intelligence Research (FAIR), is an artificial intelligence laboratory that aims to share open-source frameworks, tools, libraries, and models for research exploration and large-scale production deployment. In 2018, they released the open-source PyText, a modeling framework focused on NLP systems. Then, in August 2022, they announced the release of BlenderBot 3, a chatbot designed to improve conversational skills and safety. In November 2022, Meta developed a large language model called Galactica, which assists scientists with tasks such as summarizing academic papers and annotating molecules and proteins. Released in February 2023, LLaMA (Large Language Model Meta AI) is a transformer-based foundational large language model by Meta that ventures into both the AI and academic spaces. The model aims to help researchers, scientists, and engineers advance their work in exploring AI applications. It will be released under a non-commercial license to prevent misuse, and access will be granted to academic researchers, individuals, and organizations affiliated with the government, civil society, academia, and industry research facilities on a selective case-by-case basis. The sharing of codes and weights allows other researchers to test new approaches in LLMs. The LLaMA models have a range of 7 billion to 65 billion parameters. LLaMA-65B can be compared to DeepMind's Chinchilla and Google's PaLM. Publicly available unlabeled data was used to train these models, and training smaller foundational models require less computing power and resources. LLaMA 65B and 33B have been trained on 1.4 trillion tokens in 20 different languages, and according to the Facebook Artificial Intelligence Research (FAIR) team, the model's performance varies across languages. The data sources used for training included CCNet (67%), GitHub, Wikipedia, ArXiv, Stack Exchange, and books. LLaMA, like other large scale language models, has issues related to biased & toxic generation and hallucination. 6. Eleuther's GPT-Neo Models Founded in July 2020 by Connor Leahy, Sid Black, and Leo Gao, EleutherAI is a non-profit AI research lab The organization has emerged as a leading player in large-scale natural language processing research, with a focus on interpretability and alignment of large models. Their mission is to ensure that the ability to study foundation models is not limited to a few companies, promoting open science norms in NLP, and creating awareness about capabilities, limitations, and risks around these models. In December 2020, EleutherAI curated a dataset of diverse text for training LLMs called the Pile, which consisted of an 800GiB dataset. Subsequently, in March 2021, they released GPT-Neo models. EleutherAI also released GPT-J-6B in June 2021, which is a 6 billion parameter language model, making it the largest open-source GPT-3 like model at the time. Additionally, they combined CLIP with VQGAN to develop a free-to-use image generation model, which guided the foundation of Stability AI. EleutherAI also trains language models in other languages, such as Polyglot-Ko, which were trained in collaboration with the Korean NLP company TUNiB. EleutherAI used Google's TPU Research Cloud Program, but by 2021, they took funding from CoreWeave. The company also uses TensorFlow Research Cloud for cheaper computing resources. In February 2022, EleutherAI released the GPT-NeoX-20b model, which became the largest open-source language model of any type at the time. In January 2023, the company was formally incorporated as a non-profit research institute. EleutherAI's NLP model, GPT-NeoX-20B, is trained on 20 billion parameters using the company's GPT-NeoX framework and GPUs from CoreWeave. The GPT-NeoX-20B model has a 72% accuracy on LAMBADA sentence completion. When measured for zero-shot accuracy for Stem using Hendrycks Test Evaluation, it had an average of 28.98%. The model uses the Pile dataset for training and consists of data from 22 sources that falls under the following 5 categories: academic writing (Pubmed Abstracts and PubMed Central, arXiv, FreeLaw, USPTO Backgrounds, PhilPapers, NIH Exporter), web-scrapes and Internet resources (CommonCrawl, OpenWebText2, StackExchange, Wikipedia-English), prose (BookCorpus2, Bibliotik, Project Gutenberg), dialogue (Youtube subtitles, Ubuntu IRC, OpenSubtitles, Hacker News, EuroParl), and miscellaneous (GitHub, the DeepMind Mathematics dataset, Enron Emails). GPT-NeoX-20B is publicly accessible and a pre-trained general-purpose autoregressive transformer decoder language model. It is a powerful few-shot reasoner with 44 layers and a hidden dimension size of 6144 and 64 heads. Additionally, it uses 1.1. Rotary Positional Embeddings instead of learned positional embeddings, as found in GPT models. 7. Cohere's XLarge Founded in 2019 by Aidan Gomez, Ivan Zhang, and Nick Frosst, Toronto-based Cohere specializes in natural language processing (NLP) models. Cohere has improved human-machine interactions and aided developers in performing tasks such as summarizing, classification, finding similarities in content, and building their own language models. Cohere's API helps users design tools for language comprehension and offers a backend toolkit for integration in multiple ways. Cohere provides two types of large language models: Generation Language Models and Representation Language Models. The company uses a foundation model to train AI systems on large-scale data, enabling them to learn from new data to perform various tasks. Generative AI aims to develop human-like creations through coding, and Cohere competes with similar model providers like OpenAI and Anthropic, with the point of differentiation being the focus on serving enterprise users in incorporating generative AI. Cohere's goal is to make NLP accessible to all while building machines that are safe to use. In September 2021, Cohere raised $40 million, and a few months later, in November 2021, Google Cloud announced its partnership with Cohere. The company intends to use Cloud's TPU for the development and deployment of its products, and Sagemaker by Amazon also gives access to Cohere's language AI. Cohere powers Hyperwrite, which helps in quickly generating articles. AWS has also announced a partnership with Cohere AI. To date, Cohere has raised $170 million, and with the ongoing rush of funding in AI platforms, the Canadian startup is expected to be valued at $6 billion. Cohere is set to introduce a new dialogue model to aid enterprise users in generating text while engaging with the model to fine-tune the output. Cohere's Xlarge model resembles ChatGPT but provides developers and businesses with access to this technology. Cohere's base model has 52 billion parameters compared to OpenAI's GPT-3 DaVinci model, which has 175B parameters. Cohere stresses on accuracy, speed, safety, cost, and ease of use for its users and has paid much attention to the product and its design, developing a cohesive model. 8. Anthropic AI's Claude Anthropic is an American AI startup and public benefit corporation founded in 2021 by Daniela Amodei and Dario Amodei, former members of OpenAI. The company specializes in developing AI systems and language models, with a particular focus on transformer architecture. Anthropic's research on the interpretability of machine learning systems covers fields ranging from natural language and interpretability to human feedback, scaling laws, reinforcement learning, and code generation, among others. The company stresses the application of responsible AI and presents itself as an AI safety and research company working towards building reliable, steerable, and interpretable AI systems. By 2022, Google had invested nearly $400 million in Anthropic, resulting in a formal partnership between the two companies and giving Google a 10% stake in Anthropic. Outside backing amounted to $580 million, with total investments in Anthropic exceeding $1 billion to date. Anthropic has developed a conversational large language model AI chatbot named Claude, which uses a messaging interface and a technique called constitutional AI to better align AI systems with human intentions. AnthropicLM v4-s3 is a 52-billion-parameter, autoregressive model, trained unsupervised on a large text corpus. The ten principles used by Anthropic are based on the concepts of beneficence, non-maleficence, and autonomy. Claude is capable of a variety of conversational and text-processing tasks, such as summarization, search, creative and collaborative writing, Q&A, and coding. It is easy to converse with, more steerable, and takes directions on personality, tone, and behavior. Anthropic offers two versions of Claude - Claude (Claude-v1) and Claude Instant. Claude-v1 is a powerful, state-of-the-art high-performance model capable of handling complex dialogue, creative content generation, and detailed instructions. Claude Instant is lighter, less expensive, and much faster, making it suitable for handling casual dialogues, text analysis, and summarization. However, Claude is an expensive platform compared to ChatGPT. Anthropic vouches for Claude to be an honest, helpful, and harmless AI system, and much less likely to produce harmful outputs than present chatbots, which have been known to be toxic, biased, use offensive language and hallucinate. According to Anthropic, Claude cannot access the internet and is designed to be self-contained and trained to avoid sexist, racist, and otherwise toxic outputs, along with preventing human engagement in illegal and unethical activities. However, compared to ChatGPT, Claude is poor at math and programming. Still, the platform has also been seen to hallucinate and provide dubious instructions. Another major concern is that it is possible to intrude upon Claude's built-in safety features through clever prompting. The embargo on media coverage of Claude was lifted in January 2023, and a waiting list of users who wanted early access to Claude was released in February. Claude is now available and accessible to users through the Poe app by Quora. Also, Discord Juni Tutor Bot, an online tutoring solution, is powered by Anthropic. Additionally, Claude has found integration with Notion, DuckDuckGo, RobinAI, Assembly AI, and others. 9. AI21's Jurassic Models AI21 Labs specializes in Natural Language Processing to develop generative AI models that can understand and generate text. The Tel Aviv-based startup was founded in 2017 by Yoav Shoham, Ori Goshen, and Amnon Shashua. AI21 has emerged as a rival to OpenAI. In 2019, the startup raised $9.5 million, and in October 2020; it launched Wordtune which was an AI-based writing app. AI21 Labs launched AI21 Studio and Jurassic-1 in August 2021. This was followed by Walden Catalyst investing $20 million in AI21 Labs in November, soon after which the company completed a $25 million series A round led by Pitango First. AI21 raised $64 million in the next round of funding. In January, AI21 Labs launched Wordtune Spices and Jurassic-2 in March 2023. The Jurassic-1 model by AI21 Labs generates human-like texts and performs complex tasks like question answering, text classification, and others. The Jurassic-1 model comes in two sizes. Jurassic-1 Jumbo contains 178 billion parameters. The model uses a unique 250,000 token vocabulary and includes multi-word tokens, reducing the model's need to use a large number of tokens and thus improving the computational efficiency and reducing latency. Jurassic-1 allows developers to train custom versions of the model with just 50100 training examples helping users to build customized applications and services. Jurassic-1 has been notably used by Latitude to scale production of its gaming world, by Harambee to create a custom chatbot to increase sign-ups for its youth employment programs, and by Verb to build a writing tool for authors. The next iteration of Jurassic (Jurassic-2) is a highly customizable language model. It has comprehensive instruction tuning on proprietary data, which gives it advanced instruction following capabilities. The model supports languages like Spanish, French, German, Portuguese, Italian, and Dutch. Compared to the Jurassic-1 model, it has up to 30% faster response time, significantly reducing latency. Jurassic-2 has three sizes, with each one having a separate instruction-tuned version - Large, Grande, and Jumbo. Jurassic-2 helps users to build virtual assistants and chatbots and helps in text simplification, content moderation, creative writing, etc. Jurassic-2 also has zero-shot instruction capabilities. The model boasts of the most current knowledge and up-to-date database, with training being based on data updated in the middle of 2022, as compared to ChatGPT, which had closed its database by the end of 2021. Jurassic-2 comes with five APIs built for businesses that want specifically tailored generative AI features. The APIs include tools for paraphrasing, summarizing, checking grammar, segmenting long texts by topic, and recommending improvements. On Stanford's Holistic Evaluation of Language Models (HELM), Jurassic-2 Jumbo ranks second with an 86.8% win rate. Jurassic-2 is available for free till May 1st, 2023. 10. Baidu's ERNIE Model Baidu, based in Beijing, is a prominent Chinese company that specializes in artificial intelligence. In 2019, Baidu launched a powerful AI language model named Ernie (Enhanced Representation through Knowledge Integration), which has been open-sourced along with its code and pre-trained model based on PaddlePaddle. Since its inception, Ernie has undergone significant improvements and can now execute a diverse array of tasks, such as language comprehension, language generation, and text-to-image generation. ERNIE was designed to enhance language representations by implementing knowledge masking strategies, such as entity-level masking and phrase-level masking. Baidu launched ERNIE 2.0 in July 2019, which introduced a continual pre-training framework. This framework incrementally builds and learns tasks through constant multi-task learning. ERNIE 3.0 was unveiled in early 2021 and introduced a unified pretraining framework that allows collaborative pretraining among multi-task paradigms. Unlike other models such as GPT-3, ERNIE 3.0 showcased task-agnostic zero-shot and few-shot learning capabilities and could be easily tailored for natural language understanding and generation tasks with zero-shot learning, few-shot learning, or fine-tuning. In late 2021, Baidu released ERNIE 3.0 Titan, a pre-training language model with 260 billion parameters that were trained on massive unstructured data. Baidu developed ERNIE Bot, its latest large language model (LLM), and generative AI product. It is designed to serve as a foundational AI platform that can facilitate intelligent transformations in various industries, including finance, energy, media, and public affairs. Access to ERNIE Bot is currently limited to invited users, with the API expected to be available to enterprise clients through Baidu AI Cloud after application (as of March 16th). Baidu aims to use the capabilities of ERNIE Bot to revolutionize its search engine, which holds the dominant position in China. Moreover, it is anticipated that ERNIE Bot will improve the operational efficiency of various mainstream industries, including cloud computing, smart cars, and home appliances. Hardware and Cloud Platforms Nvidia's H100 Tensor Core, their ninth-generation data center GPU, contains 80 billion transistors and is optimized for large-scale AI and high-performance computing (HPC) models. The A100, Nvidia's predecessor to the H100, is one of the best GPUs for deep learning. There is also Google's Tensor Processing Units (TPUs) which are custom-designed accelerator application-specific integrated circuits (ASIC) used for efficient machine learning workloads and are tightly integrated with TensorFlow, Google's machine learning framework. Google Cloud Platform has opened availability of TPU v4 on Cloud, specifically designed to accelerate NLP workloads, and has also developed TPU v5 for use internally. Microsoft Azure also offers GPU instances powered by Nvidia GPUs, such as the A100 and P40, that can be used for various machine learning and deep learning workloads. Another key development is the partnership between Microsoft Azure and OpenAI, which has given OpenAI the resources to train both GPT-3 and GPT-4 that resulted in the availability of these models for developers in their applications through Azure's cloud infrastructure. AWS provides access to GPUs such as the Amazon Elastic Compute Cloud (EC2) P3 instances, which offer up to 8 Nvidia V100 GPUs with 5,120 CUDA cores and 300 GB of GPU memory. AWS has also developed its own chips for inference(Inferentia) and training (Trainium). Several advanced models have been developed on these computing and cloud systems, including BERT, RoBERTa, Bloom, Megatron and the GPT family. BERT is one of the first pre-trained models that incorporated transformer architecture and resulted in state of the art scores in many NLP tasks. RoBERTa is a variant of BERT, trained on a much larger dataset with a more efficient training procedure. Lastly, Bloom is an open-access multilingual language model, containing 176 billion parameters and was trained on 384 A10080GB GPUs. The increasing availability of specialized hardware for NLP tasks represents a significant development in cloud computing programs. With the availability of these tools, companies can now train and run models that were previously impossible to build. A note on Open Source Open-source LLMs efforts have been progressing, both in terms of open data sets and open source models available for anyone to fine tune and use. The overall potential of open source models are very promising. They provide a more in-depth access to LLMs for everyone, not just by using an API. However there are definitely questions on the increased risks of models that haven't been aligned - and are more flexible to adapting for nefarious use cases such as misinformation. AI efforts like Eleuther's ""The Pile"" and LAION's LAION-5B dataset have facilitated rapid progress in text and image modeling. Many companies and groups are also making foundational models accessible with open-source data sets, such as Big Science's Bloom model and the strategic partnership between Hugging Face and Amazon Web Services (AWS), which increases the availability of open-source data sets and models hosted on Hugging Face. Stability AI also supports EleutherAI's work studying Large Language Models, while Laion's project involves crowdsourcing annotations for its OpenAssistant ChatGPT replication project. Additionally, Carper has developed open-source RLHF workflows ranging from human annotation with CHEESE to do RLHF training using trlX package. Generative AI applied to other modalities By some measures, consumer facing Generative AI has become the fastest growing technology trend of all time, with various models emerging for image, text, and code generation. For example, MidJourney's Discord has attracted around 13 million members for Image Generation, while ChatGPT has reportedly gained over 100 million users within a few months of release. Software development use cases have also seen a significant rise with over 1.2 million developers using GitHub Copilot's technical preview as of September. 1. Image Generation: Dall-E MidJourney Stable Diffusion DreamStudio The combination of models, data, and computing has provided an incredible set of tools for working with images. OpenAI's DALL-E is an AI system that uses deep learning and transformer language models to generate digital images from natural language descriptions. It employs a decoder-only transformer model that models text and images as a single data stream containing up to 256 tokens for text and 1024 for images. The neural network then autoregressively models them. DALL-E is a 12-billion parameter version of GPT-3. The model uses a causal mask for text tokens and sparse attention for image tokens. DALL-E 2 is capable of producing higher-resolution images and uses zero-shot visual reasoning. It can create anthropomorphized versions, fill in the blanks, and transform existing images. However, DALL-E uses public datasets as training data, which can affect its results and often leads to algorithmic biases. Midjourney is an artificial intelligence program developed by Midjourney, Inc., an independent research lab. The platform uses natural language descriptions to generate images, and users can create images by using Discord bot commands on the official Discord server. On March 16, 2023, beta version 5 was released. Users can generate images by typing the /imagine command followed by the prompt, and the bot generates four images, from which the user selects the image they want to upscale. Midjourney Inc. is also developing a web interface. Stable Diffusion is an open source image model funded by Stability AI that generates images from text and performs tasks like inpainting, outpainting, and generating image-to-image translations. It uses a latent diffusion model supported by EleutherAI and LAION. It requires a minimum of 8GB VRAM making it independent of needing cloud services. Stable Diffusion 2.0 was released in November 2022 and trained on pairs of images and captions from LAION-5B and its subsets. DreamStudio is the official online implementation and team interface API for Stable Diffusion, developed by Stability AI. DreamStudio and Stable Diffusion have slightly different interfaces even as they are applications of the same technology. The web app was launched in August 2022, replacing the free Discord bot. The web app offers better functionality and stability, using the Stable Diffusion algorithm to generate images based on the user's prompt. DreamStudio API Access has an access fee. One of the key features of DreamStudio is its support for negative prompting. It also allows users to overpaint, copy, modify, and distribute images for commercial purposes. 2. Audio Generation: Whisper AudioGen AudioLM Whisper, developed by OpenAI, is a versatile automatic speech recognition system that supports multilingual speech recognition, speech translation, and language identification. It has been trained on 680,000 hours of multilingual and multitask supervised data using Python 3.9.9 and PyTorch 1.10.1, and the codebase is expected to be compatible with Python 3.83.10 and recent PyTorch versions. It deploys an encoder-decoder transformer model that uses 30-second chunks of input audio converted to log-Mel spectrograms, which are then passed to an encoder. The decoder predicts the corresponding text caption and intermixes special tokens to perform various tasks. Whisper provides an open-source model and inference codes for speech processing research and new application development. With nearly one-third of its dataset being non-English, Whisper outperforms the supervised state-of-the-art on CoVoST2 to English translation zero-shot. Google's AudioLM is a pure audio model that uses language modeling to generate high-quality audio without annotated data. It generates speech continuations that preserve the identity, prosody, and accent of the speaker and recording conditions, and can also generate coherent piano music continuations. The model demonstrates long-term consistency in syntax, harmony, rhythm, and melody, and has the potential for extension to multilingual speech, polyphonic music, and audio events. AudioLM uses a hybrid tokenization scheme and a SoundStream neural codec to improve fidelity. The model achieved a 51.2% success rate from human raters and an audio classifier with 98.6% accuracy was trained to detect synthetic speech generated by AudioLM. Currently, AudioLM is only available for research purposes and is not publicly available. Meta's AudioGen AI converts text prompts into audio files. It is the audio parallel of image-generating AI like DALL-E. It uses a language AI model and approximately 4000 hours of training data to generate ambient sounds, sound events, and their composition. Additionally, it can extend existing audio to create rudimentary music. The quality of the audio output has been rated at 70% via Amazon's Mechanical Turk platform. However, AudioGen currently cannot sequence sounds through time, and the ownership rights of the generated audio are unclear. 3. Search Engines: Neeva You Neeva is an AI-powered search engine that provides ad-free and private searches. It achieves this through its in-house LLMs and search stack, while also blocking third-party website trackers and not sharing user information. Neeva's unique feature is its AI summaries, which provide synthesized answers backed by cited authority. It also allows users to search personal email accounts, calendars, and cloud storage platforms. This feature combines the best aspects of LLMs, like ChatGPT, with authority and timeliness. However, it only functions with question queries and has limitations on the free version (the premium plan is priced at $4.95/mo). Neeva has over 2 million users and local language versions in Germany, France, and Spain. You.com is a California-based search engine that uses multimodal conversational AI to group web results into website categories sorted by user preferences. It was launched for public beta in November 2021 with a focus on privacy and personalization. It offers YouWrite, a text generator, and YouChat, a chatbot with community-built apps and blended LLMs. You.com does not collect users' personal information and offers personal and private search modes. The search results allow users to create content directly from the search results, building trust and reliability. 4. Code Generation: Copilot Codex GitHub Copilot is a tool that assists developers in programming by using AI to convert natural language into coding suggestions. It is powered by OpenAI Codex, which allows it to understand the developer's coding style and suggest context-specific solutions. When developers input their desired logic into the system, GitHub Copilot can generate code suggestions automatically. However, it is important to note that these suggestions are just that, suggestions, and it is up to the developer to decide whether to use them or not. OpenAI Codex is a natural language processing model that is based on GPT-3 and can generate working code in multiple programming languages such as Python, JavaScript, and Ruby, among others. To train Codex, billions of lines of source code from public sources, as well as natural language data, including code from GitHub repositories, were used. It has a memory of 14KB for Python code and is a powerful, transformer-driven system that can effectively and efficiently fulfill developers' tasks. 5. Text Generation: Jasper Jasper.AI is a subscription-based text generation model that requires minimal input from the user and searches the web to generate the desired output. It is particularly useful for generating short copy text where character limitations are important. The platform offers over 50 templates, including product descriptions, email subject lines, and Facebook headlines, among others. Additionally, it can help with generating ideas for blog posts and creating better outlines. However, Jasper.AI does have some drawbacks, such as the absence of fact-checking and citation of sources, which can lead to hallucinations. Additionally, learning the command input to achieve the desired output may take some time. Conclusion Generative AI is a revolutionary technology that has the ability to transform many aspects of our lives. Keep in mind that there are still challenges in developing these models such as massive datasets, compute power, high training cost, and accessibility. Studies have revealed that many large language models are not adequately trained. Additionally, smaller datasets are still crucial for enhancing LLM performance in domain-specific tasks. Compute cost optimization is also essential since generative models, especially large language models, are still expensive to both train and serve for inference. Big players in the industry are working on optimizing compute costs at every level. Safety and security remain pressing concerns in the development of generative AI, and key players are incorporating human feedback to make the models safer from the outset. Open-source alternatives are also necessary to increase access to the next-generation LLM models for practitioners and independent scientists to push the boundaries forward.",https://pub.towardsai.net/the-generative-ai-revolution-exploring-the-current-landscape-4b89998fcc5f#7d83,towards_ai -"Building Intuition on the Concepts behind LLMs like ChatGPT - Part 1- Neural Networks, Transformers, Pretraining, and Fine Tuning","Neural Networks LLMs like ChatGPT are trained on huge amounts of publicly accessible text data from the internet using artificial neural networks. Artificial neural networks are machine learning algorithms that are designed to mimic in an abstract way, our brain's structure and learning process. They are made up of layers of interconnected nodes or ""neurons,"" and through repeated training iterations on massive amounts of text data, the network learns the patterns in the texts and the nuances of the language - enough to generate coherent words, sentences, or whole documents by itself. The artificial neural network is the main feature of a subset of machine learning called deep learning. It is very important in the field of AI due to its ability to capture intricate patterns and dependencies in data and generalize from these patterns to make predictions on new, unseen data. In the context of language modeling, this means predicting what word should come next given a sequence of a preceding word or words. Compared to conventional machine learning algorithms like linear regression, neural networks are able to represent and model non-linear relationships between different features present in large amounts of data through the use of nonlinear mathematical functions (the activation function) in the neurons of the network's hidden layers. Neural networks have produced consumer tech that you've probably interacted with (and they are not necessarily exclusive to language tasks) such as unlocking your phone using facial recognition, the augmented reality feature in your Pokemon game, or the show suggestions in Netflix home screens. Andrej Karpathy even argues that it can be a new and better way of writing software: For example, instead of hand coding the logic of a program (- if condition A is met, do x; if condition A is not met, do y) the neural network instead learns through examples from the training data that if it encounters 'condition A' in production it should do x. These conditions/logic are not defined by its creators, rather the neural network adjusts itself (by tweaking its billions or even trillions of parameters - the weights and biases) to conform to this desired behavior. Nobody knows what each individual weight and bias does specifically or how a single weight contributes to a specific change in the behavior of the artificial neural network. These parameters are changed en masse as a unit during training via gradient updates. (discussed in more detail later.) This is why you'll often hear machine learning models trained on neural networks as 'black boxes'. Their inputs and outputs can be observed, but the internal workings or how it does what it does is not easily understood. This is also the reason for discoveries of 'emergent' capabilities. As an LLM gets bigger and bigger (measured by its number of parameters), it starts coming out of training with unanticipated abilities. For example, GPT-2 was discovered to be good at language translation, GPT-3 was an excellent few-shot learner, and GPT-4 has shown sparks of artificial general intelligence or AGI. None of these were explicitly defined as a training goal - its main objective was to predict the next word in a sequence. Emergent behaviors are not unique to large neural networks. As a system gets bigger and more complex, the interaction between its individual components can lead to unexpected behaviors that cannot be fully explained by analyzing the properties of each individual component in isolation- a single ant is stupid, but a colony of ants can build very complex tunnel networks and wage war against other colonies. This phenomenon has been documented in systems like social insects (ants and bees), crowd behavior, and other biological ecosystems. Pretraining Foundation Models The first step in creating something like ChatGPT is pretraining a base model or a foundation model. The goal of this step is to create a machine learning model that is able to autonomously generate a coherent word structure or generate human-like text (phrase, sentence, paragraph) by generating words in sequence based on its prediction of what word should come next given the preceding words. It's called pretraining as the output of this step - the base model is still a raw product that has limited practical applications and is usually only of interest to researchers. Base models are 'trained' further via the fine-tuning stages for specific tasks with a real-world utility like text translation, summarization, classification, etc. At the start of pretraining, the parameters of the neural network are set with random numerical values. The words in the massive internet text data are converted into numerical representations in the form of tokens (as an integer) and embeddings (as vectors)- before being fed to the neural network. Tokens and embeddings will be discussed in detail in the next part of this series but for now, think of a token as the unique ID of a word in the model's vocabulary and the embedding as the meaning of that word. The model is given a word or words and is asked to predict the next word based on those preceding word/s. Then, it is tested on unseen data and evaluated on the accuracy of its predictions based on the 'ground truth' next word from a hidden dataset previously unseen by the model. Consider the example sentence in the training dataset: ""I have to go to the store"". This sentence might be used as follows: The model is given ""I have"" and is expected to predict ""to"".Then it's given ""I have to"" and is expected to predict ""go"".Then ""I have to go"" and is expected to predict ""to"".Finally, ""I have to go to the"" and is expected to predict ""store"". Going through the whole corpus of the training dataset like this, the model will be able to learn which words tend to appear after different sets of words. It learns the dependencies between ""I"" and ""have"", ""have"" and ""to"", and so on. In the testing step, the process is similar, but the sentences or texts used are the ones that the model has not been trained on. It's a way to check how well the model generalizes its language understanding to unseen data. Let's consider the unseen sentence from the test set: ""She needs to head to the ___"" Even though this exact sentence was not part of the training dataset, the model can use its understanding of similar contexts it encountered to make an educated prediction. For example, it has been seen in the training sentence ""I have to go to the store"" that the phrase ""to go to the"" or ""to head to the"" is often followed by a location or a destination. Based on this, the model might predict ""market"", ""store"", ""office"", or other similar words, as they are common destinations in this kind of context. So, while the model was trained on ""I have to go to the store"" and variations of this text with similar meaning, it's able to generalize from that to understand that ""She needs to head to the..."" is likely to be followed by a similar type of word, even though this exact sentence was not part of its training data. How Models 'Learn' At the start of pretraining, the model would usually output nonsensical sequences of words when asked to make a prediction as it hasn't 'learned' anything yet. In our example sentence earlier, it might generate the word 'apple' instead of the ground truth next word - 'store'. Since LLMs are probabilistic, 'wrong' in this context means the model is assigning a higher probability (to be selected) to the word 'apple' compared to the expected word - 'store'. The ultimate goal is to have the model output 'store' every time it's asked to predict the next word that comes next after the sequence ""She needs to head to the ___"". The difference in the actual and the expected or ground truth next word is calculated using a 'loss function' where the greater the difference, the higher the 'loss' value. The loss is a single number that 'averages' the loss or error of all the predictions you asked the model to make. Through several iterations of these steps, the aim is to minimize the value of this 'loss' through processes called backpropagation and gradient descent optimization. The model 'learns' or improves its prediction ability through these steps. You're probably wondering how you can 'calculate the difference between two words' to arrive at a loss value. Do note that what goes through the neural network are not actual texts (words, sentences) but numerical representations of these texts - their tokens and embeddings. The number representations of a word sequence are processed through the layers of the network where the output is a probability distribution over the vocabulary to determine what word comes next. An untrained model might assign a higher probability to the token id of the word 'apple' (say 0.8) compared to the token id of the ground truth next word - 'store' (at 0.3). The neural network will not encounter a single word or letter of any text. It works exclusively with numbers - basically a calculator with extra steps. Through backpropagation, the degree of the error of the model (the loss value) is propagated backward through the neural network. It computes the derivative to the output of each individual weight and bias i.e. how sensitive the output is to changes in each specific parameter. For my people who didn't take on differential calculus in school (such as myself), think of the model parameters (weights/biases) as adjustable knobs. These knobs are arbitrary - in the sense that you can't tell in what specific way it governs the prediction ability of the model. The knobs, which can be rotated clockwise or counterclockwise have different effects on the behavior of the output. Knob A might increase the loss 3x when turned clockwise, knob B reduces the loss by 1/8 when turned counterclockwise (and so on). All these knobs are checked (all billions of them) and to get information on how sensitive the output is to adjustments of each knob - this numerical value is their derivative with respect to the output. Calculating these derivatives is called backpropagation. The output of backpropagation is a vector (a list of numbers) whose elements or dimensions consist of the parameters' individual derivatives. This vector is the gradient of the error with respect to the existing parameter values (or the current learnings) of the neural network. A vector has two properties: length or magnitude and direction. The gradient vector contains information on the direction in which the error or loss is increasing. The magnitude of the vector signifies the steepness or rate of increase. Think of the gradient vector as the map of a foggy hill you're descending from - gradient descent optimization is using the information about direction and steepness from the gradient vector to reach the bottom of the hill (the minimum loss value) as efficiently as possible by navigating to the path with the greatest downward incline (the opposite direction of the gradient vector). This involves iteratively adjusting the values of the weights and biases of the network (by subtracting small values to it i.e. the learning rate) en masse to reach this optimal state. After these steps, the hope is during the next training iteration, when the model is again asked to predict the next word for ""She needs to head to the..."" it should assign a higher probability to the word 'store'. This process is repeated several times until there is no significant change to the loss value meaning the model's learning has stabilized or has reached convergence. So the TL;DR on how neural networks learn to communicate in English (and other languages) is - math in serious amount. Like oodles. It boils down to reducing the value of a single number (the loss value) generated from complex computations within the neural network - where, as this number gets smaller, the more 'fluent' or 'coherent' the language model becomes. The millions or billions of mathematical operations applied between matrices and vectors in the inner layers of the network somehow coalesce into a geometric model of the language. To help with intuition, we've anthropomorphized the model by using words like 'understand', 'seen', and 'learn', but in truth, it has no capacity to do any of these things. It's just an algorithm that outputs the next best token of a sequence based on a probability distribution given a sampling method. The Transformer The Transformer is the breakthrough in natural language processing (NLP) research that gave us ChatGPT. It is a type of neural network architecture that utilizes a unique self-attention mechanism. It's famously discussed in the paper 'Attention is All You Need' that came out in 2017. Almost all state-of-the-art LLMs (like BERT, GPT-1) that came out after this paper, were built on or using ideas from the transformer. It's hard to overstate the importance of this paper due to its impact on deep learning. It's now finding its way to vision tasks making it truly multi-modal and demonstrating its flexibility to handle other types of data. It also started the '...is all you need' memetic trend that even the Towards AI editorial team is unable to resist. Prior to transformers, neural networks used in NLP that produced SOTA models, relied on architectures that utilize sequential data processing e.g. recurrent neural networks or RNNs - this means that during training, each word or token is processed by the network one after the other in sequence. Note that the order of words is important to preserve the context/meaning of a sequence - 'the cat ate the mouse' and 'the mouse ate the cat' are two sentences with two different meanings even though they are made up of the exact same words/tokens (albeit in a different order). One of the key innovations of the transformer is doing away with recurrence or sequential token processing. Instead of processing tokens sequentially, it encodes the position information of each word (i.e. in which order a word appears in the sequence being processed) into its embedding before being inputted in the network's inner layers. More importantly, transformers solved the issue of long-term dependencies that neural nets like RNNs struggled with. Given a long enough sequence of words (e.g. a very long paragraph), RNNs will 'forget' the context of the word it processed earlier in the sequence - this is called the vanishing gradient problem. RNNs store information on the relevance of words in a sequence up to that point in what's called the hidden state at each sequential or time step. As it processes a long sequence, gradients corresponding to earlier time steps can become very small during backpropagation. This makes it challenging for the RNN to learn from the early parts of the sequence and can lead to the 'loss' of information about words processed earlier. This is problematic for a next-word prediction model, especially if those 'forgotten' words are important to the context of the sequence currently being generated The transformer solves this limitation through the 'self-attention' mechanism. Same with positional encoding, each word, through its embedding is encoded with information on the degree or how much it should 'attend to' the rest of the words in the sequence - no matter the length of the sequence or the relative distance of the attended word in the sequence. This encoding is done simultaneously for all words in the sequence allowing the transformer to preserve the context of any sequence. The degree how which one word should attend to other words is a 'learned' trait stored in the model weights and encoded in the word embeddings via matrix multiplications. These 'learnings' get adjusted during each training iteration as the model learns more about the relationships between words in the training data. The final output of the self-attention layer (the Z matrix in our illustration above) is a matrix of the word embeddings that is encoded with information on the position of each word in the sequence (from the position encoding step) and how much each word should attend to all other words in the sequence. It is then fed to a traditional neural network like the one discussed earlier in the article (called the feed-forward neural network). These steps (attention + feed-forward - which makes up a transformer block) are repeated multiple times for each hidden layer of the transformer - 96 times for GPT3, for example. The transformation in each layer adds additional information to the 'knowledge' of the model on how to best predict the next word in the sequence. According to the LLM scaling laws published by OpenAI, to train better models, increasing the number of parameters is 3x more important than increasing the size of the training data. (Note: DeepMind has since published a paper with a differing view.) This translates to a significant increase in computational requirements, as handling a larger number of parameters demands more complex calculations. Parallelization, which is the process of dividing a single task into multiple sub-tasks that can be processed simultaneously across multiple compute resources, becomes essential in dealing with this problem. Parallelization is difficult to achieve with RNNs given their sequential nature. This is not an issue for transformers as it computes relationships between all elements in a sequence simultaneously, rather than sequentially. It also means that they work well with GPUs or video cards. Graphics rendering requires a large number of simple calculations happening concurrently. The numerous, small, and efficient processing cores that a GPU has, which are designed for simultaneous operations, make it a good fit for tasks such as matrix and vector operations that are central to deep learning. AI going 'mainstream' and the mad scramble to build larger and better models is a boon to GPU manufacturers. NVIDIA- specifically - whose stock price has grown 200% YTD as of this writing, has made them the highest-performing stock this year and pushed their market cap to USD 1 trillion. They join megacaps like Apple, Google, Microsoft, and Amazon in this exclusive club. The Transformer is a decidedly complex topic and the explanation above wholesale left out important concepts in order to be more digestible to a broader audience. If you want to know more, I found these gentle yet significantly more fleshed-out introductions to the topic: Jay Allamar's illustrated transformer, Lili Jiang's potion analogy, or if you want something more advanced - Karpathy's nanoGPT that babbles in Shakepear-ish. Fine-tuning 'chat' models like ChatGPT The output of pretrainings are base models or foundation models. Examples of recently released text-generation foundation models are GPT-4, Bard, LLaMa 1 & 2, and Claude 1 & 2. Since base models already have extensive knowledge of the language from pretraining (the structure of sentences, relationships between words, etc.), you can leverage this knowledge to further train the model to do specific tasks - translation, summarization, or conversational assistants like ChatGPT. The underlying idea is that the model's general language understanding gained from pretraining can be used for a wide range of downstream tasks. This idea is called transfer learning. If you ask or prompt a base model a question, it will probably reply with another question. Remember that it's trained to complete a word sequence by predicting the word that should come next given the previous words in the sequence. Example: However, we can get a base model to answer questions by 'tricking' it into thinking that it's trying to complete a sequence: Using this idea, the model goes through another round of training using different sets of prompt/completion pairs in a question-and-answer format. Instead of 'learning English' from random texts found on the internet by predicting what words come next after a set of words, the model 'learns' that to complete a prompt in a 'question' form, the completion should be in an 'answer' form. This is the supervised fine-tuning (SFT) stage. Pretraining utilizes a type of machine learning called self-supervised learning where the model trains itself by creating the 'label' or ground truth word it's trying to predict from the training data itself. Here is our example from earlier: The model is given ""I have"" and is expected to predict ""to"".Then it's given ""I have to"" and is expected to predict ""go"". The labels or target words 'to' and 'go' are created by the model as it goes through the corpus. Note that the target/ground truth words are important as this is the basis of the loss value - i.e. how good the current model's prediction is versus the target- and the subsequent gradient updates. Compared to the pretraining phase, the training data preparation in the fine-tuning stages can be labor intensive. It requires human labelers and reviewers that will do careful annotation of the 'labels' or the target completions. However, since the model has already learned general features of the language, it can quickly adapt to the language task it's being fine-tuned for, even with the limited availability of task-specific training data. This is one of the benefits of transfer learning and the motivation behind pretraining. According to Karpathy, 99 percent of the compute power and training time and most of the data to train an LLM are utilized during the pretraining phase and only a fraction is used during the fine-tuning stages. Fine-tuning uses the same gradient update method outlined earlier but this time it's learning from a list of human-curated question/answer pairs that teaches it how to structure its completions i.e. 'what to say and how to say it'. It goes through other fine-tuning stages like reward modeling and reinforcement learning from human feedback (RLHF) to train the model to output completions that cater more to human preference. In this stage, the human labelers score the model's completions on attributes like truthfulness, helpfulness, harmlessness, and toxicity. The human-preferred completions get reinforced into the training i.e. it will have a higher probability to appear in completions of the fine-tuned version of the model. The output of these fine-tuning steps is the 'assistant' or 'chat' models like ChatGPT. These are the 'retail' versions of these foundation models and are what you interact with when you go to the ChatGPT website. The GPT-3 base model (davinci) can be accessed via an API. The GPT-4 base model has not been released as an API as of this writing and is unlikely to be released by OpenAI, given their recent statements about competition and LLM safety. These fine-tuning steps are generally the same for all available commercial and open-source fine-tuned models. End of Part 1 Note: Part 2 will talk about Embeddings which predates the LLM explosion but is equally as fascinating - how embedding models are trained, and how a sentence or document-level embeddings (used in RAG systems) are generated from word embeddings. We will also be discussing about Tokens and why it's needed. We have implied in this post that token = word to simplify things, but a real-world token can be an individual character or letter, a subword, a whole word, a series of words, or all of these types in a single model vocabulary! If I got anything wrong, I'm happy to be corrected in the comments! :) Resources/References: 3blue1brown. What is a neural network? Geeks for Geeks. Artificial Neural Networks and its Applications Jay Alammar. The Illustrated Transformer Luis Serrano. What Are Transformer Models and How Do They Work? Andrej Karpathy. The State of GPT Andrej Karpathy. Let's build GPT: from scratch, in code, spelled out.",https://pub.towardsai.net/building-intuition-on-the-concepts-behind-llms-like-chatgpt-part-1-4cb6654ab67#123f,towards_ai -WizardCoder: Why It's the Best Coding Model Out There,"What Sets WizardCoder Apart One might wonder what makes WizardCoder's performance on HumanEval so exceptional, especially considering its relatively compact size. To put it into perspective, let's compare WizardCoder-python-34B with CoderLlama-Python-34B: The unique and most important factor of such large difference in HumanEval benchmark performance is the dataset the model trained on. The Power of Data: WizardCoder's Unique Dataset One of the key factors contributing to WizardCoder's remarkable performance is its training dataset. Most models rely on a dataset structure that typically includes: Solid base with a lot of simple instructionsReduced amount of complex instructionsAnd minimal amount of really complex instructions To train a model for peak performance on evaluation benchmarks, training dataset should have a balance between simple instructions, complex instructions and really complex instructions. This is where WizardCoder's dataset shines. It boasts: Good amount of really complex instructionsGood amount of complex instructionsSolid base with a lot of simple instructions But there's a challenge: creating a dataset with complex instructions is inherently difficult, while simple instructions are readily available. Evol Instruct Evol-Instruct is an evolutionary algorithm for generating diverse and complex instruction datasets using LLMs(GPT-4). It is designed to enhance the performance of LLMs by providing them with high-quality instructions that are difficult to create manually. In simple teams, Evol-Instruct is a complexity cascade of synthetically genearted (GPT-4) instruction dataset. Instruction Evolution LLMs can make given instructions more complex and difficult using specific prompts. Additionally, they can generate entirely new instructions that are equally complex but completely different. Using this, we can iteratively evolve an initial instruction dataset, improving the difficulty level and expanding its richness and diversity. A. Instruction Evolver The Instruction Evolver is an LLM that uses prompts to evolve (develop) instructions, with two types: In-depth evolving.In-breadth evolving A base dataset is given (e.g., Alpaca: generated using self-instruct, or 70k ShareGPT (shared by real users)) and using this base dataset, we can create a more complex and diverse dataset. a) In-depth Evolving In-Depth Evolving enhances instructions by making them more complex and difficult through five types of prompts: Prompt of In-depth Evolving In-depth Evolving aims to: (i) Add constraints (ii) Deepening (iii) Concretizing (more specific) (iv) Increased Reasoning Steps (v) Complicating Inputs The core part of In-Depth Evolving's prompt is The example prompt of add constraints is: These prompts help generate a complex instruction dataset, with similar templates for the other types of In-depth Evolving. b) In-breadth Evolving In-breadth Evolving addresses the limitation of open-domain instruction finetune datasets (e.g., Alpaca, ShareGPT, etc.), which are often small in scale, and lacking topic and skill diversity. In-breadth Evolving solves this problem by designing a prompt to generate a completely new instruction based on the given instruction, requiring the new instruction to be more long-tailed. Prompt of In-breadth Evolving In-breadth Evolving aims to 1. Enhance topic coverage 2. skill coverage 3. Overall dataset diversity The in-breadth prompt is as follows: B. Response Generation The same LLM used to generate the corresponding responses for the evolved instructions using the prompt: C. Elimination Evolving(Instruction Eliminator) The evolved instruction may challenge the LLM to generate a response. Sometimes, when the generated response contains ""sorry' and is relatively short in length (i.e., less than 80 words), it often indicates that the LLM struggles to respond to the evolved instruction. So, we can use this rule to make a judgment. The response generated by the LLM only contains punctuation and stop words. D. Finetuning the LLM on the Evolved Instructions Once all evolutions are done, the initial instruction dataset (the 52K instruction dataset of Alpaca) merges with evolved instruction data from all epochs and randomly shuffled the samples to create the final fine-tuning dataset. This processing ensures an even distribution of instructions of varying difficulty levels in the dataset, maximizing model fine-tuning smoothness. Wizardlm validates Evol-Instruct by fine-tuning open-source LLaMA 7B with evolved instructions and evaluating its performance and name the model WizardLM. Evol-Instruct works by generating a pool of initial instructions(52k instruction dataset of Alpaca), which are then evolved through a series of steps to create more complex and diverse instructions. Once the instruction pool is generated, it is used to fine-tune an LLM, resulting in a new model called WizardCoder. The fine-tuning process involves training the LLM on the instruction data to improve its ability to generate coherent and fluent text in response to various inputs. Prompt Format For WizardCoder, the Prompt should be as follows: Best Use Cases WizardCoder can be used for a variety of code-related tasks, including code generation, code completion, and code summarization. Here are some examples of input prompts that can be used with the model: Code generation: Given a description of a programming task, generate the corresponding code. Example input: ""Write a Python function that takes a list of integers as input and returns the sum of all even numbers in the list.""Code completion: Given an incomplete code snippet, complete the code. Example input: ""def multiply(a, b): \n return a * b _""Code summarization: Given a long code snippet, generate a summary of the code. Example input: ""Write a Python program that reads a CSV file and calculates the average of a specific column."" The 34B model is not just a coding assistant; it's a powerhouse capable of: Automating DevOps Scripts: Generate shell scripts or Python scripts for automating tasks.Data Analysis: Generate Python code for data preprocessing, analysis, and visualization.Machine Learning Pipelines: Generate end-to-end ML pipelines, from data collection to model deployment.Web Scraping: Generate code for web scraping tasks.API Development: Generate boilerplate code for RESTful APIs.Blockchain: Generate smart contracts for Ethereum or other blockchain platforms Evaluation WizardCoder beats all other open-source Code LLMs, attaining state-of-the-art (SOTA) performance, according to experimental findings from four code-generating benchmarks, including HumanEval, HumanEval+, MBPP, and DS-100. WizardCoder-Python-34B has demonstrated exceptional performance on code-related tasks. The model has outperformed other open-source and closed LLMs on prominent code generation benchmarks, including HumanEval (73.2%), HumanEval+, and MBPP(61.2%). WizardCoder-Python-34B-V1.0 attains the second position in HumanEval Benchmarks, surpassing GPT4 (2023/03/15, 73.2 vs. 67.0), ChatGPT-3.5 (73.2 vs. 72.5) and Claude2 (73.2 vs. 71.2). WizardCoder-15B-v1.0 model achieves the 57.3 pass@1 on the HumanEval Benchmarks, which is 22.3 points higher than the SOTA open-source Code LLMs, including StarCoder, CodeGen, CodeGee, and CodeT5+. Additionally, WizardCoder significantly outperforms all the open-source Code LLMs with instructions fine-tuning, including InstructCodeT5+, StarCoder-GPTeacher, and Instruct-Codegen-16B. In conclusion, WizardCoder's success is attributed to its unique dataset and the innovative use of Evol-Instruct to enhance instruction complexity, leading to its outstanding performance across various code-related tasks and benchmarks. References YouTube: WizardCoder 34B: Complex Fine-Tuning Explained GitHub Paper: WizardLM- Empowering Large Language Models to Follow Complex Instructions Paper: WizardCoder: Empowering Code Large Language Models with Evol-Instruct",https://pub.towardsai.net/wizardcoder-why-its-the-best-coding-model-out-there-46a089c2833#0f8e,towards_ai