-
Running Phi 3 with vLLM and Ray Serve
While everyone is talking about new models and their possible use cases, their deployment aspect often gets overlooked. The journey from a trained model to a production-ready service is a complex and nuanced process that deserves more attention. From the perspective of a web API server, when a developer needs to access information like user…
-
Primer on Distributed Parallel Processing with Ray using KubeRay
In the early days of computing, applications handled tasks sequentially. As the scale grew with millions of users, this approach became impractical. Asynchronous processing allowed handling multiple tasks concurrently, but managing threads/processes on a single machine led to resource constraints and complexity. This is where distributed parallel processing comes in. By spreading the workload across…
-
Prometheus vs CloudWatch for Cloud Native Applications (Updated in 2024)
This post (originally published in 2019) has been updated to include recent product updates and cost calculations. Many companies are moving to Kubernetes as the platform of choice for running software workloads. When an organization using VMs in AWS earlier decides to move to Kubernetes (Either EKS or self-managed in AWS), one of the questions…
-
Virtual Clusters for Kubernetes
If you speak to teams or organizations running Kubernetes in production, one of the complaints you’ll often hear is how difficult multi-tenancy is. Organizations follow two approaches to share Kubernetes clusters with multiple tenants (multiple teams or people). Those two approaches are: Namespace-based multi-tenancy Cluster-based multi-tenancy Namespace-based multi-tenancy: In namespace-based multi-tenancy, each team or tenant…
-
Running Llama 3 with Triton and TensorRT-LLM
In the training phase, a machine learning (ML) model recognizes patterns in the training data and stores these patterns as numerical values called weights (model parameters). These parameters are used to predict an answer when the model is given new input data or a question. For example, when you ask a question to ChatGPT or…
-
Improving RAG Accuracy with Rerankers
In our previous post, we talked about creating an AI agent for technical communities that can use the conversation history amongst colleagues and other members to answer the user’s common questions. InSightful is the agent we built that uses the Reasoning and Action (ReAct) approach to respond to user queries accurately. However, during the retrieval…
-
Developing an AI Agent for Smart Contextual Q&A
Accelerated by the pandemic, online tech communities have grown rapidly. With new members joining every day, it’s tough to keep track of past conversations. Often, newcomers ask questions that have already been answered, causing repetition and redundancy. To tackle this, we built an intelligent assistant that tracks past conversations, searches Stack Overflow for technical help,…
-
Guide to GPU Sharing Techniques: vGPU, MIG and Time Slicing
Optimizing GPU utilization is essential in modern computing, particularly for AI and ML processing, where GPUs play a pivotal role due to their unparalleled ability to handle parallel computations and process large datasets rapidly. Modern GPUs are invaluable in these fields. They have thousands of cores that enable very high parallelism. This enables complex model…
-
Key Elements of an Internal Developer Platform (IDP)
Developer platforms reduce developers’ cognitive load by abstracting away the various complexities of infrastructure and the development process, eventually improving their productivity and workflow. However, building a developer platform can be complex, and organizations could fail at it the first time. The primary reason is the lack of proper communication between the team building the…
-
Introduction to NVIDIA Network Operator
Artificial intelligence (AI) and machine learning (ML) are transforming industries, from automotive to finance. Since the advent of using GPUs for AI/ML workloads in the last few years, processing large amounts of data has become significantly faster. Workload orchestrators like Kubernetes have also played an important role in maximizing GPU compute power. However, one of…