You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Argues that middleboxes are a permanent part of the Internet and proposes a Delegation-Oriented Architecture (DOA) that enables end-hosts to explicitly authorize middlebox processing.
Proposes APLOMB, which outsources enterprise middlebox processing (e.g., firewalls, WAN optimizers) to the cloud, reducing cost and management complexity.
Builds tiny, specialized VMs (as small as 5MB) on top of Xen and Click, achieving boot times under 30ms and near-line-rate throughput for middlebox processing.
Enables middleboxes to perform deep packet inspection on encrypted traffic without decryption, using garbled circuits and tokenization for oblivious keyword search.
Uses Rust's type and memory safety to isolate NFs within a single process instead of separate VMs/containers, enabling zero-copy packet passing between chained NFs.
Decouples NF logic from data-plane execution by decomposing NFs into reusable processing blocks, allowing multiple NF applications to share common blocks and reducing redundant computation.
Uses static analysis to automatically identify and categorize internal state variables in middlebox code, generating scaffolding for state export/import to support migration, scaling, and fault tolerance.
Decouples NF state from processing by storing all state in a remote low-latency data store, making NF instances stateless and simplifying scaling, migration, and fault tolerance.
Automatically constructs a parallel execution DAG from a sequential NF chain by analyzing read/write dependencies, enabling intra-chain parallelism with up to 2.5x throughput improvement.
Introduces a rate-aware backpressure mechanism that signals upstream NFs to slow down when downstream NFs are congested, combined with dynamic CPU scheduling to avoid wasted work in chains.
Eliminates inter-core communication overhead by leveraging hardware classification (NIC RSS, Flow Director) to pin entire service chains to a single CPU core.
Proposes S6, a framework with a distributed shared state abstraction that allows NF instances to transparently access and migrate per-flow state, enabling elastic scaling with minimal disruption.
Uses hardware performance counters to profile NF sensitivity to shared resource contention, then makes interference-aware placement decisions to satisfy per-NF latency and throughput SLOs.
Provides customizable, per-NF TCP stacks where each NF subscribes only to the TCP events it needs, avoiding the cost of a full TCP implementation while enabling flow-level visibility.
Extends Click with a full, modular TCP/IP stack and POSIX-compatible socket API, enabling transport-layer NFs to be built as compositions of reusable Click elements.
Implements stateful NFs directly in programmable switch hardware using extended finite state machines (EFSMs), overcoming the limitation of stateless match-action tables.
Uses symbolic execution to derive formal, human-readable performance contracts that map each packet's execution path through an NF to its precise latency in cycles.
Provides per-flow transactional semantics across chains of stateful NFs by co-designing the state management layer with the packet processing pipeline for minimal overhead.
Enables push-button formal verification of NFs without expertise in formal methods, combining symbolic execution with carefully designed data structure abstractions.
Automatically compiles software NFs (e.g., Click) to programmable switches (P4), partitioning logic between the switch data plane and CPU based on hardware constraints.
Extends switch match-action tables with off-chip DRAM to support NFs with large state (e.g., large flow tables), achieving near-line-rate performance despite limited on-chip memory.
Predicts NF performance under co-location by profiling NFs in isolation and composing profiles with lightweight micro-benchmarks that characterize resource sensitivity.
Applies the serverless paradigm to NFs, deploying them as auto-scaled, event-triggered functions, addressing challenges of persistent state across ephemeral invocations.
Extends performance contracts into composable, modular performance interfaces that can be composed to predict the performance of NF chains without re-analyzing the full system.
Designs an NFV platform for public cloud environments where hardware-level optimizations (DPDK/SR-IOV) may be unavailable, bridging the gap between NFV research and real cloud deployments.
Proposes Memory-Compute Units (MCUs) that decouple state access from packet processing in switch ASICs, enabling richer stateful NFs at full line rate.
An exokernel-inspired OS abstraction for augmenting programmable switches with compute and memory from co-located rack servers, exposing resource heterogeneity to NF developers.
Disaggregates NF state into a shared remote store while running stateless processing instances, enabling independent scaling of compute and state with careful caching and batching to minimize overhead.
Automatically parallelizes single-threaded NF code across multiple cores by analyzing state access dependencies and partitioning/replicating state for safe concurrent execution.
A declarative query language for expressing complex, stateful network traffic analysis policies (e.g., multi-step attack detection) over streaming packet data, compiled into efficient automata for real-time execution.
Enables analysts to specify "what" to detect rather than "how," supporting composition of temporal and cross-flow correlations that are error-prone to implement imperatively.
Introduces consistent network update abstractions (per-packet and per-flow consistency) that guarantee network-wide policy invariants are maintained during SDN rule transitions, preventing transient violations.
Uses a two-phase update mechanism: new rules are installed across all switches before traffic is shifted, ensuring every packet sees either the old or new policy, never a mix.
A regular-expression-based query language for monitoring network paths taken by packets, compiled into switch-level rules that encode path history via packet tagging.
The compiler uses determinization and tag minimization to generate efficient forwarding rules for runtime path-level monitoring with low overhead.
Provides a "one-big-switch" programming abstraction with mutable per-flow state, letting programmers write stateful packet-processing programs as if the entire network were a single switch.
The compiler handles state placement and program partitioning across physical switches, bridging the global abstraction and the distributed reality of limited-resource devices.
A reusable, event-driven monitoring stack exposing flow-level abstractions (TCP state events, reassembled bytestreams) so middlebox developers need not re-implement TCP reconstruction.
Provides a monitoring socket API with per-flow event callbacks, enabling IDS, proxy, and load balancer applications to be written concisely on a common substrate.
Introduces Quantitative Regular Expressions for networking (NetQRE), a declarative language for quantitative monitoring queries (e.g., traffic entropy, SYN-flood ratios) that go beyond boolean pattern matching.
The compiler generates streaming algorithms from NetQRE programs with formal worst-case performance guarantees, bridging expressive queries and line-rate processing.
Proposes Marple, a SQL-like query language for performance monitoring (e.g., per-flow latency, TCP incast detection) that compiles to programmable switch hardware with key-value store augmented pipelines.
Key insight: co-designing the language and hardware -- the language restricts queries to those efficiently implementable in hardware, while the hardware is designed to support the language's linear-in-state operations.
A declarative query interface for telemetry that automatically partitions execution between programmable switches (for early data reduction) and streaming processors (for complex analysis), reducing data volume sent to the backend.
A hardware-independent language and compiler for data plane programming that abstracts away differences between heterogeneous switch ASICs, enabling a single program to be compiled to multiple backend targets.
A DSL for writing event-driven control programs that execute entirely within the switch data plane, enabling reactive control logic (e.g., congestion response, failure detection) at data-plane speed without controller round-trips.
Provides shared state abstractions (registers, counters, tables) across a network of programmable switches with configurable consistency models, letting developers write programs as if operating on a single switch.
Enables programmable switches to intercept and process RPC messages (e.g., aggregation, caching) by handling the mismatch between variable-length RPC serialization formats and fixed-pipeline switch hardware.
A Click-inspired modular framework for in-network computing that provides a unified abstraction across heterogeneous programmable devices (SmartNICs, switches, FPGAs) with automatic partitioning and placement.
Introduces XDP, a framework for running eBPF programs at the earliest point in the Linux network stack (before socket buffer allocation), enabling line-rate packet processing with full kernel bypass or selective forwarding.
Compiles XDP/eBPF programs to run on FPGA-based SmartNICs, offloading packet processing from the host CPU while maintaining the familiar eBPF programming model.
Applies formal verification (using Coq and SMT solvers) to Linux's BPF JIT compilers, finding and fixing bugs in production code while demonstrating that formal methods can be practical for kernel development.
Proposes extending eBPF to storage I/O paths, allowing applications to inject custom logic (e.g., filtering, aggregation) into the kernel's storage stack to reduce data movement and kernel crossings.
Uses eBPF/XDP to implement an in-kernel cache for Memcached that intercepts GET requests before the network stack, achieving significant throughput improvements by avoiding user-kernel transitions for cache hits.
Uses program synthesis to automatically generate correct and efficient eBPF packet-processing code from high-level specifications, overcoming the difficulty of manually writing verifier-compliant eBPF programs.
Enables user-space implementation of CPU scheduling policies via a kernel agent that delegates scheduling decisions, allowing rapid iteration on scheduling algorithms without kernel modifications.
Provides a unified framework for user-defined scheduling policies across CPU, network, and storage using eBPF, enabling application-specific cross-layer scheduling optimizations.
Embeds lightweight neural network inference within eBPF programs to enable ML-driven decisions (e.g., congestion control) directly in the kernel datapath with microsecond-scale latency.
Allows applications to push storage operations (e.g., B-tree lookups) into the kernel via eBPF, reducing I/O round-trips by chaining dependent reads within the NVMe driver.
Offloads performance-critical paths of distributed protocols (e.g., Paxos, chain replication) to eBPF in the kernel, reducing latency by avoiding user-space context switches on the critical path.
Uses eBPF to implement a high-performance database proxy that bypasses user-space for common-case query routing, falling back to user-space only for complex cases.
Proposes automatically identifying and offloading performance-critical application code fragments to eBPF in the kernel, treating eBPF as a general-purpose kernel acceleration substrate.
Analyzes security vulnerabilities in the eBPF ecosystem, demonstrating how malicious eBPF programs can exploit verifier weaknesses or timing side-channels despite the safety guarantees.
Implements distributed transaction coordination (2PC, OCC) in eBPF to minimize latency by keeping the critical path entirely in kernel space, bypassing user-space transaction managers.
Allows applications to define custom file prefetching policies via eBPF hooks in the Linux page cache, enabling workload-specific prefetching without kernel modifications.
Uses ECN marks from switches to achieve fine-grained congestion control in datacenters, reacting proportionally to the extent of congestion rather than treating any congestion signal as severe.
Achieves near-optimal flow completion times by decoupling flow scheduling from rate control: switches prioritize packets by remaining flow size, while senders transmit at line rate with minimal state.
Uses precise RTT measurements (enabled by NIC hardware timestamps) as the primary congestion signal, avoiding the deployment complexity of ECN while achieving low latency in datacenters.
A UDP-based transport protocol with built-in encryption (TLS 1.3), 0-RTT connection establishment, multiplexed streams without head-of-line blocking, and connection migration across network changes.
Introduces ExpressPass, a credit-based congestion control where receivers explicitly schedule sender transmissions, providing bounded queuing delay and near-zero packet loss in datacenters.
Proposes NDP, combining per-packet multipath spraying with receiver-driven flow control and switch trimming (headers only on congestion) to achieve ultra-low latency and high throughput.
A connectionless, receiver-driven protocol that uses in-network priority queues to schedule packets by remaining message size (SRPT), achieving low tail latency for short messages.
Leverages in-network telemetry (INT) to obtain precise link utilization and queue information, enabling congestion control that converges quickly to high utilization with near-zero queuing.
A transport protocol designed specifically for RPCs, with request-level (not connection-level) load balancing and a policy-based scheduling abstraction for implementing various scheduling disciplines.
A delay-based congestion control for datacenters that uses NIC timestamps to measure one-way delays, separating fabric and endpoint congestion for targeted responses.
Provides a proactive transport primitive that pre-allocates bandwidth before data transmission, enabling deadline-aware scheduling and predictable latency for latency-sensitive traffic.
Uses a power function (throughput × delay) as the congestion signal, achieving both high throughput and low latency by balancing these competing objectives more effectively than prior approaches.
Argues that TCP's byte-stream abstraction and reliability semantics are mismatched for in-network computing (e.g., aggregation at switches), proposing a message-oriented transport with relaxed ordering.
Designs a transport protocol tailored for distributed DNN training traffic patterns (e.g., all-reduce), exploiting predictable communication patterns for better scheduling and reduced tail latency.
A message transport protocol enabling in-network compute operations (e.g., aggregation) by providing message-level reliability and allowing switches to process and transform messages in transit.
Argues that microservice architectures shift verification challenges from intra-application correctness to inter-service protocol and contract verification, requiring new tools for checking composition-level properties.
Presents DAGOR, WeChat's production overload control system that uses business-priority-based admission control at each service, with cooperative admission across the call graph to shed low-priority requests early and prevent cascading overload.
Shows that the optimal threading model (inline, synchronous, or asynchronous) for microservices varies with load and microservice characteristics, and proposes an online system that automatically selects and tunes the threading configuration to minimize tail latency.
Introduces DeathStarBench, an open-source suite of representative end-to-end microservice applications (social network, hotel reservation, etc.) for studying microservice performance, revealing that microservices have distinct hardware implications (e.g., deep call graphs amplify tail latency, high OS/network overhead).
Uses deep learning on distributed tracing data and hardware-level metrics to proactively predict QoS violations in microservice systems before they occur, and identifies the culprit microservice causing the violation.
Dynamically partitions shared hardware resources (cores, cache, memory bandwidth) among co-located latency-sensitive services using a gradient-descent-inspired controller that detects QoS violations and reallocates resources in near-real-time.
Offloads lightweight microservices (e.g., proxies, load balancers) to SmartNIC ARM cores, freeing host CPU for compute-intensive services while significantly reducing energy consumption per request.
Google's production autoscaler that uses ML (time-series forecasting) to recommend CPU and memory limits for jobs, reducing resource slack and out-of-memory events at scale.
Combines online telemetry with ML models to detect SLO violations, pinpoint the responsible microservice via resource-usage anomaly detection, and apply per-microservice resource adjustments (vertical/horizontal scaling, traffic routing) to restore SLO compliance.
A study on how microservices spend their CPU cycles. It shows that, within Facebook, microservices spend only a small fraction of their execution time service core application logic, and significant cycles on orchestration work (e.g., compression, serialization, and I/O processing).
A serverless runtime optimized for microsecond-scale internal function calls in microservice applications, using shared-memory message channels and a concurrency-aware scheduler to minimize inter-function invocation overhead.
Applies causal Bayesian networks over per-microservice metrics to identify root causes of QoS violations, scaling to large deployments by decomposing the global dependency graph into per-service local models.
Uses an LSTM-based model to predict end-to-end latency from per-microservice resource allocations, then applies a reinforcement-learning agent to dynamically adjust per-service resources to meet SLOs while minimizing total resource usage.
Large-scale analysis of Alibaba's production microservice traces, revealing structural properties of dependency graphs (e.g., heavy fan-out, long call chains) and their impact on tail latency amplification.
Proposes workload-aware right-sizing of microservice containers using time-series prediction of resource demands, combined with bin-packing scheduling to reduce resource waste while meeting SLOs.
Introduces a systematic approach to fault injection testing at the service level (rather than individual API calls), automatically generating fault injection campaigns that explore how failures in one service propagate through the microservice dependency graph.
Estimates per-request resource consumption by analyzing request content (e.g., API parameters, payload size) using deep learning, enabling more accurate resource provisioning than load-agnostic approaches.
Uses deep learning to predict future workload and autoscale microservice replicas to maintain stable CPU utilization targets, deployed at scale in Alibaba Cloud to reduce resource waste from reactive scaling oscillations.
Combines workload prediction with a queuing-theory model to proactively determine the number of microservice replicas needed, avoiding the lag and oscillation of reactive autoscalers.
Provides a formally verified compilation framework (mu2sls) that automatically transforms microservice applications to run on serverless platforms while preserving exactly-once semantics and fault tolerance guarantees.
Proposes hindsight logging that retroactively captures detailed traces only when anomalies are detected, enabling root-cause analysis of rare edge cases without the overhead of always-on verbose tracing.
Uses lightweight probing and learned models to quickly detect and recover from QoS degradations caused by dynamic microservice behaviors (e.g., version updates, traffic shifts).
A large-scale empirical study of Meta's microservice topology and request workflows, revealing patterns such as highly skewed fanout distributions and long critical paths.
Meta's production service mesh that achieves low overhead by embedding routing logic into client libraries rather than sidecars, with centralized control for policy and load balancing.
Uses eBPF to capture network-level tracing data (TCP flows, latency) without application instrumentation, correlating network events with application traces for root-cause analysis.
A detailed performance analysis of service mesh sidecar proxies (e.g., Envoy), identifying sources of latency and CPU overhead and proposing optimizations.
Builds causal models from distributed traces to predict how changes (e.g., scaling, code optimizations) will affect end-to-end latency distributions across the microservice graph.
Language and system support for complex safety properties that reason about the flow of requests across the whole microservice network (not just between adjacent hops).
Proposes letting applications define their own network abstractions and policies within service meshes, rather than relying on fixed infrastructure-level primitives.
A general-purpose online resource allocator that uses Bayesian optimization to learn application performance models and allocate resources to meet diverse objectives (latency, throughput, fairness).
A compilation framework that separates microservice application logic from infrastructure concerns, enabling the same application code to be deployed across different backends (serverless, containers, etc.).
A caching framework that automatically identifies caching opportunities in microservice call graphs and places caches at optimal points to reduce redundant computation and latency.
Combines coarse-grained cluster-level scheduling with fine-grained per-service throttling to efficiently meet SLOs while maximizing resource utilization in microservice deployments.
Reconstructs distributed traces from network-level observations (packet timing, connection patterns) without requiring application-level instrumentation or trace ID propagation.
An overload control mechanism that sheds load at entry points based on downstream capacity signals, preventing overload from propagating deep into the microservice graph.
Eliminates per-pod sidecars by using per-node proxies with kernel-level traffic interception, reducing resource overhead while maintaining service mesh functionality at cloud scale.
A hardware-software co-designed resource manager that uses hardware performance counters and ML models to rapidly detect and mitigate SLA violations in dynamic microservice environments.
Proposes a two-tier policy architecture where expressive high-level policies (Copper) are compiled into fast data-plane rules (Wire), balancing policy expressiveness with enforcement performance.
Instead of load balancing to equalize load, dynamically shifts load between microservice replicas to exploit resource availability variations in shared clusters.
A decentralized overload control system where each microservice independently makes admission decisions while coordinating through request metadata to achieve system-wide overload protection.
Provides a high-level programming model for defining application-specific network policies and behaviors in service meshes, abstracting away low-level proxy configuration.
Provides a high-performance packet I/O framework that maps NIC rings directly into user space, eliminating per-packet system calls and achieving line-rate packet processing.
Provides predictable low latency for datacenter applications by using deadline-aware scheduling and admission control to bound tail latency under load.
Improves network performance on multicore systems by ensuring that connection processing happens on the same core that handles the application, reducing cache misses and cross-core communication.
A scalable network I/O API that batches system calls and partitions the listening socket across cores to eliminate contention, achieving significantly higher connection rates than BSD sockets.
A user-level TCP stack that achieves high scalability on multicore systems by eliminating kernel crossing overhead and using per-core data structures to avoid locking.
Argues for generating application-specific network stacks that include only the features needed, reducing code complexity and improving performance for specific workloads.
A dataplane OS that provides low-latency, high-throughput networking by separating the control plane (Linux) from a specialized, run-to-completion dataplane with zero-copy I/O.
Removes the OS from the I/O data path by using hardware virtualization (SR-IOV) to give applications direct access to network and storage devices, while the OS manages only resource allocation.
Enables low-latency networking while retaining the standard OS TCP/IP stack by using dedicated NICs with memory-mapped ring buffers, combining kernel stack compatibility with user-space performance.
Provides a framework for composable network stack extensions, allowing modules (e.g., WAN optimizers, traffic shapers) to be dynamically inserted into the stack without kernel modifications.
Extends hardware RSS to be aware of CPU load and connection state, dynamically redistributing flows to balance load while maintaining flow affinity for stateful processing.
Accelates TCP stack by splitting the stack into a "fast" data path (for data transport of established connections) and a control plane (for connection and context management, congestion control etc.).
Google's user-space networking stack that runs as a microkernel-style service, enabling rapid iteration on networking features while providing isolation between applications and the network stack.
Provides a high-performance socket implementation that is fully compatible with existing applications by using RDMA for data transfer while maintaining the POSIX socket API.
A detailed measurement study breaking down where CPU cycles are spent in the Linux network stack, identifying key bottlenecks (e.g., memory allocation, locking) and quantifying their impact.
A hardware-software co-designed network stack that moves protocol processing into the NIC and wakes threads directly from network events, achieving sub-microsecond RPC latency.
Presents methodology and tools for pinpointing sources of latency at nanosecond granularity in complex end-host network stacks, using hardware timestamps and careful instrumentation.
Proposes managing RPC as an OS-level service that handles serialization, transport, and load balancing, decoupling applications from RPC implementation details.
Implements request cloning (hedged requests) for microsecond-scale RPCs at the NIC level, reducing tail latency by speculatively sending requests to multiple replicas without software overhead.
A measurement and analysis framework for understanding network performance of datacenter applications, correlating application-level metrics with network-level behavior.
A large-scale study of RPC characteristics at Google, revealing patterns in message sizes, call rates, and latency distributions that inform RPC system design.
Leverages CXL memory pooling to implement zero-copy RPC, where caller and callee share memory regions directly, eliminating serialization and network transfer overhead.
Introduced a cache loader micro-benchmark to profile application performance under varying cache-usage pressure and use the profile to predict the impact of cache interference among consolidated workloads
Each application is profiled 1) using a memory antagonist to obtain its (memory) sensitivity curve and 2) to measure the pressure on the memory it generates.
Profile each NF’s cache ref/sec running alone and its performance drop curve when collocating with a synthetic antagonist. Predict the performance drop with these profiles.
Detect interference via differential low-level metrics (see Table 1), validate the interference and identify the interfering resource by running the victim in isolation, and mitigate interference via migration.
Identifies that network latency long tails in public clouds stem from OS/virtualization-layer interference (not the network fabric) and proposes VM-level techniques (e.g., careful placement and outbound pacing) to mitigate tail latency for latency-sensitive cloud applications.
Uses collaborative filtering (similar to recommendation systems) to classify incoming workloads with minimal profiling and predict their sensitivity to interference and hardware heterogeneity, enabling QoS-aware placement without exhaustive benchmarking.
Uses cycles-per-instruction (CPI) as metrics to detect workload interference and identify perpetrators (and address the interference by throttling). Key takeaway: CPI correlates with application performance and CPI is a stable metrics.
Co-location leads to increases in queuing delay, scheduling delay, and thread load imbalance. Addresses interference online via re-provisioning and scheduling.
Manage workload (LC+BE) colocations via an online controller that monitors latency and resource usage and manages the isolation mechanism for different resources.
Characterize how performance isolation can break in virtualized network stack in terms of network bandwidth and network stack processing rate. Provides an abstraction and construct based on bandwidth, latency, and loss rate to detect isolation breakdown and enforce isolation.
Achieves both high CPU efficiency and low tail latency by rapidly reallocating cores between applications at microsecond timescales using a centralized core arbiter.
Use online telemetry data (resource usage and latency) and offline learned models to detect and localize microservices that cause SLO violations and mitigate violations via dynamic re-provisioning.
Clark's seminal paper identifying key design principles for protocol architecture, including the end-to-end argument, fate-sharing, and the importance of placing functionality at the right layer.
Proposes DONA, a clean-slate architecture that replaces DNS with flat, self-certifying names and in-network resolution, enabling data-centric rather than host-centric networking.
Introduces Content-Centric Networking (CCN/NDN), where content is addressed by name rather than location, with in-network caching and request aggregation as first-class primitives.
An expressive internet architecture that supports multiple principal types (hosts, services, content) with fallback paths, enabling incremental deployment of new network abstractions.
A network stack that introduces a service-level abstraction between transport and application layers, enabling service discovery, migration, and load balancing independently of IP addresses.
Argues for architectural pluralism: designing the Internet to support multiple co-existing architectures rather than a single universal design, with mechanisms for graceful evolution.
Automatically optimizes IPC between containers by detecting communication patterns and replacing socket-based IPC with shared memory when processes are co-located.
Reduces container startup time by lazily fetching image layers on-demand rather than downloading entire images upfront, exploiting the observation that containers use only a small fraction of their image data.
Analyzes production Docker registry workloads at IBM, identifying inefficiencies in layer deduplication and proposing optimizations for storage and distribution.
Enables slim containers by separating application binaries from debugging/development tools, which can be dynamically attached when needed without bloating the container image.
Addresses the problem of network processing consuming CPU cycles charged to the wrong container, providing accurate accounting and isolation of network-induced CPU usage.
Provides RDMA networking for containers by virtualizing RDMA in software, enabling container migration and multi-tenancy while preserving near-native RDMA performance.
Reduces container overlay network overhead by moving encapsulation and routing logic into the kernel, eliminating user-space proxy overhead while maintaining container network abstractions.
Identifies vulnerabilities in Linux cgroups that allow containers to escape resource limits, demonstrating attacks that consume unbounded CPU, memory, or I/O despite cgroup restrictions.
Addresses the overhead of creating network endpoints for short-lived serverless functions by pre-provisioning connection state and using lightweight endpoint assignment.
Improves container overlay network performance by parallelizing packet processing across multiple cores, addressing the bottleneck of single-threaded encapsulation/decapsulation.
Enables live migration of containers using RDMA by transparently checkpointing and restoring RDMA connection state, allowing memory-intensive applications to be migrated without modification.
Accelerates container provisioning at edge locations by using delta compression and optimized layer transfer protocols designed for high-latency WAN links.
Enables multiple containers to share GPUs transparently by intercepting CUDA calls and implementing fair scheduling and memory isolation without requiring application modifications.
Reduces container overlay network overhead by caching network state and forwarding decisions, minimizing per-packet processing while maintaining the flexibility of overlay networks.
About
This repository contains a list of papers on various topics (that I am working/worked on) in the system and networking area.