You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+16-1Lines changed: 16 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -13,14 +13,29 @@ This project delivers reference infrastructures powered by Intel AI hardware and
13
13
14
14
The recommended **Infrastructure Cluster** is built with [**Intel® scalable Gaudi® Accelerator**](https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Architecture.html#gaudi-architecture) and standard servers. The [Intel® Xeon® processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon/xeon6-product-brief.html) are used in these Gaudi servers as worker nodes and in standard servers as highly available control plane nodes. This infrastructure is designed for **high availability**, **scalability**, and **efficiency** in **Retrieval-Augmented Generation (RAG) and other Large Language Model (LLM) inferencing** workloads.
15
15
16
-
The [**Gaudi embedded RDMA over Converged Ethernet (RoCE) network**](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Scaling_Guide/Theory_of_Distributed_Training.html#theory-of-distributed-training), along with the [**3 Ply Gaudi RoCE Network topology**](https://docs.habana.ai/en/latest/Management_and_Monitoring/Network_Configuration/Configure_E2E_Test_in_L3.html#generating-a-gaudinet-json-example) supports high-throughput and low latency LLM Parallel Pre-training and Post-training workloads, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). For more details, see: [Training and fun-tuning LLM Models with Intel Enterprise AI Foundation on OpenShift](https://github.com/intel/intel-technology-enabling-for-openshift/wiki/Fine-tunning-LLM-Models-with-Intel-Enterprise-AI-Foundation-on-OpenShift)
16
+
The [**Gaudi embedded RDMA over Converged Ethernet (RoCE) network**](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Scaling_Guide/Theory_of_Distributed_Training.html#theory-of-distributed-training), along with the [**3 Ply Gaudi RoCE Network topology**](https://docs.habana.ai/en/latest/Management_and_Monitoring/Network_Configuration/Configure_E2E_Test_in_L3.html#generating-a-gaudinet-json-example) supports high-throughput and low latency LLM Parallel Pre-training and Post-training workloads, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). For more details, see: [Training and fine-tuning LLM Models with Intel Enterprise AI Foundation on OpenShift](https://github.com/intel/intel-technology-enabling-for-openshift/wiki/Fine-tunning-LLM-Models-with-Intel-Enterprise-AI-Foundation-on-OpenShift)
17
17
18
18
This highly efficient infrastructure has been validated with cutting-edge enterprise AI workloads on the production-ready OpenShift platform, enabling users to easily evaluate and integrate it into their own AI environments.
19
19
20
20
Additionally, Intel SGX, DSA, and QAT accelerators (available with Xeon processors) are supported to further enhance performance and security for AI workloads.
21
21
22
22
For more details, see: [Supported Red Hat OpenShift Container Platform (RHOCP) Infrastructure](https://github.com/intel/intel-technology-enabling-for-openshift/blob/main/docs/supported_platforms.md#supported-intel-hardware-features)
23
23
24
+
## AI Accelerators & Network Provisioning
25
+
Provisioning AI accelerators and networks on a scalable OpenShift/Kubernetes cluster while ensuring the manageability of AI infrastructure and platforms presents significant challenges. To address this, the [general Operator concept](https://github.com/intel/intel-technology-enabling-for-openshift/wiki/Intel-Technology-Enabling-for-OpenShift-Architecture-and-Working-Scope#architecture-options) has been proposed and implemented in this project.
26
+
27
+
28
+
[OpenShift/Kubernetes Operators](https://www.redhat.com/en/technologies/cloud-computing/openshift/what-are-openshift-operators) automate the management of the software stack, streamlining AI infrastructure provisioning. Rather than relying on a single monolithic operator to handle the entire stack, the [operator best practice](https://sdk.operatorframework.io/docs/best-practices/best-practices/) - **"do one thing and do it well"** - is applied. This industrial-leading approach significantly simplifies both Operator development and the AI provisioning process.
29
+
30
+
*[**Intel® Network Operator**](https://github.com/intel/network-operator) allows automatic configuring and easier use of RDMA NICs with Intel AI accelerators.
31
+
*[**Intel® Device Plugins Operator**](https://catalog.redhat.com/software/container-stacks/detail/61e9f2d7b9cdd99018fc5736) handles the deployment and lifecycle of the device plugins to advertise Intel AI accelerators and other Hardware feature resources to OpenShift/Kubernetes.
32
+
*[**Kernel module management (KMM) operator**](https://github.com/rh-ecosystem-edge/kernel-module-management) manages the deployment and lifecycle of out-of-tree kernel modules like Intel® Data Center GPU Driver for OpenShift*
33
+
*[**Machine config operator (MCO)**](https://github.com/openshift/machine-config-operator) provides an unified interface for the other general operators to configure the Operating System running on the OpenShift nodes.
34
+
*[**Node Feature Discovery (NFD)**](https://docs.redhat.com/en/documentation/openshift_container_platform/4.18/html/specialized_hardware_and_driver_enablement/psap-node-feature-discovery-operator) Operator detects and labels AI hardware features and system configurations. These labels are then used by other general operators.
35
+
*[**Intel® Converged AI Operator**]() will be used in the future to simplify the usage of the general operators to provision Intel AI features as a stable and single-entry point.
36
+
37
+
The Other general Operators can be added in the future to extend the AI features.
38
+
24
39
## Releases and Supported Platforms
25
40
Intel Enterprise AI foundation for OpenShift is released in alignment with the OpenShift release cadence. It is recommended to use the latest release.
0 commit comments