About this role: Wells Fargo is seeking a Principal Engineer - Generative Gen AI GPU Infrastructure Capabilities. In this role, you will: Act as an advisor to leadership to develop or influence applications, network, information security, database, operating systems, or web technologies for highly complex business and technical needs across multiple groups Lead the strategy and resolution of highly complex and unique challenges requiring in-depth evaluation across multiple areas or the enterprise, delivering solutions that are long-term, large-scale and require vision, creativity, innovation, advanced analytical and inductive thinking Translate advanced technology experience, an in-depth knowledge of the organizations tactical and strategic business objectives, the enterprise technological environment, the organization structure, and strategic technological opportunities and requirements into technical engineering solutions Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions Maintain knowledge of industry best practices and new technologies and recommends innovations that enhance operations or provide a competitive advantage to the organization Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership Required Qualifications: 7+ years of Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education Desired Qualifications: Design GPU cluster topologies (H100/H200, NVLink/NVSwitch ), networking, and storage paths for high‑throughput inferencing; document sizing and perf baselines. Implement Run: AI constructs (Collections/Departments/Projects/workloads) for MDEV/MDEP/UCEP/MRM ; codify quota, priority, and fair‑share policies. POC & benchmark disaggregated inferencing (prefill/decode) with vLLM / TensorRT‑LLM ; publish guidance for H100/H200 tuning (FP8/INT8/AWQ) and KV‑transfer behavior over NVLink. Operationalize OpenShift AI parity for GPU scheduling, time slicing/MIG profiles, and preemption; validate upgrade paths and helm/kustomize packaging. Integrate Triton Inference Server for multi‑model serving; standardize model repository structure, batching, dynamic shapes, and telemetry hooks. (Supported broadly by platform docs; add repo specifics when you share them.) Harden NGDC environments with AVI/GSLB patterns (Prod1/Prod2) and BCP; execute DR failover runbooks and steady‑state capacity planning. Publish steady‑state runbooks (deploy → certify → promote): DEV → UAT → MDEP‑Beta → MDEP‑GA / UCEP ; define promotion criteria and risk exceptions. Own endpoint product ionization via Apigee (AI Gateway) —authN/Z, rate limiting, API SLAs, versioning/deprecation and SDK generation for internal consumers. Embed observability/evaluations with Overwatch + Arize : prompt/agent/tool tracing, SLO dashboards, alerting, and data‑retention /export workflows. Automate CI/CD for infra and model artifacts: image scanning (JFrog remote repo), chart releases, canaries, and rollback plans across OCP/GKE. Tune CUDA kernels/graph execution paths; profile NCCL collectives; resolve performance bottlenecks (HBM bandwidth, kernel fusion, p2p comms). (NCCL inferred per assumption.) Qualify LLM/SLM runtimes (Gemma, Llama, GPT‑OSS, etc.) with Run: AI scheduling; publish per‑model recipes for throughput, latency, cost and stability. Define GPU estate hygiene: image provenance, secrets handling, namespace/network policy baselines, and change controls for upgrades (e.g., Run: AI v2.21+ ). Partner with product/TPM/PO to align backlog to platform milestones (OpenShift AI go‑forward, SuperPOD activation waves, endpoint rollouts). Mentor engineers; lead deep‑dive reviews and present in exec/tech forums (CIO/ARB/offsites) with architecture readouts, performance data, and risk mitigations. NVIDIA & CUDA: CUDA/cuDNN usage, NVLink/NVSwitch understanding, MIG setup, NCCL tuning, GPU profiling, H100/H200 optimization. Optimize kernels and collectives, choose MIG profiles, validate interconnect bandwidth and NUMA/PCIe topology for LLM/SLM workloads. LLM/SLM Runtimes: Work with vLLM, TensorRT‑LLM, Triton; apply FP8/INT4 quantization; tune KV‑cache strategies. Build POCs for disaggregated prefill/decode, standardize Triton repos, and optimize batching. Orchestration: Use Run: AI structures (Collections/Departments/Projects), manage OCP/GKE environments . Implement GPU allocation patterns, enforce quotas, preemption, fair‑share scheduling. OpenShift AI: Configure RHOAI GPU scheduling and time slicing, use helm/kustomize, validate upgrades . Achieve platform parity, certify charts and policies, ensure admission controls function reliably. API & Gateway: Apply Apigee authN/Z, manage quotas, rate limits, OpenAPI specs, SDK generation, SLA operations . Productionize model endpoints, manage versioning and deprecation, enforce gat