AI Intelligent Computing Platform

The AI intelligent computing platform is dedicated to creating a new model for the construction and operation of computing power centers. It manages tens of thousands of GPU resources, providing unified scheduling, optimizing strategies for shorter cycles, and enabling on-demand usage. It supports large-scale parallel training scenarios, covering the entire AI computing process from model design, training, deployment, to inference, and manages AI infrastructure like local resources.

Product Features (Key Capabilities)

Diversified Computing Power Unified Scheduling: - Unified scheduling and management of multi-region, multi-type GPU computing power, general computing power, network resources (IB, RoCE), and storage resources (NVMe, parallel file storage, and object storage).
- Supports GPU Direct, NVLink, and other technologies to meet all scenario business needs from model fine-tuning to training.

Computing Pool and Partitioning: - Supports the creation of shared and dedicated computing pools.
- Supports multi-instance GPUs, passthrough (single GPU, multiple GPUs), and vGPU solutions.
- Some GPUs support fine-grained partitioning of memory and computing resources, ensuring effective isolation and security.
- Easily meets demands for computing resources of different granularities.

Intelligent simple operation and maintenance fine operation: - A unified operation and maintenance management platform standardizes and visualizes high-efficiency operation and maintenance resources.
- Helps administrators achieve fine resource allocation and standardized operation of various computing scenario services.
- Combined with multi-dimensional resource monitoring, it improves computing power utilization efficiency.

Intelligent Computing Power Scheduling Management: - Capable of managing tens of thousands of GPUs in distributed scheduling.
- Supports priority, reservation, pause, resume, resource sharing, and pre-emptive scheduling.
- Customized placement group policies automatically allocate and manage computing resources, significantly reducing task execution time.
- Computing resources are allocated on-demand and switched automatically.

One-stop AI Computing Service: - Provides services including algorithm code writing, model training, model fine-tuning, model management, model deployment inference, etc.
- Offers support such as image repositories, function libraries, drivers, deep learning frameworks, datasets, etc.
- Integrates multiple vendor solutions to help customers realize AI application implementation across the board.

Computing Pool and Partitioning: - Provides users with self-service capabilities, including account creation, account top-up, resource application, online activation, resource usage, and resource release.
- Services can provide shared resources by task based on GPU card hours or dedicated GPU resources, improving user experience and resource supply efficiency.

Resource analysis and division

Principle of division: - Segmentation by GPU type
- Divided by the computing services provided by the platform

Special Instructions: - A compute node can belong to more than one resource group
- Virtualization resources are separated from the whole machine resources to avoid fragmentation of the machine

Shared Resource Pool: - Users can submit computing tasks to the shared resource pool, occupy GPU cards for the task duration, release GPU cards after task completion, and calculate based on card-hours.

Development Resource Pool: - Users can submit computing tasks to the shared resource pool, occupy GPU cards for the task duration, release GPU cards after task completion, and calculate based on card-hours.

Inference Resource Pool: - Users deploy inference service resources, billed on demand.

Dedicated Resource Pool: - The platform provides users with exclusive container nodes. Users can apply for dedicated single-card and multi-card instances, and the system will reserve these resources, billed monthly.

GPU Virtualization: - Provides virtualized instances for some GPU cards, such as 1/4 card instances; supports configuring various partitioning rules.

Just-in-Time Use: - The allocation of computing resources for the same task follows the principle of allocation based on the nearest switching location principle.

Just-in-Time Use

- The allocation of computing resources for the same task follows the principle of allocation based on the nearest switching location principle.

Scheduling principle

- Queucalls can be made by task priority
- Provides resource group reserved resources
- quota restriction

Multiple storage services

- Queucalls can be made by task priority
- Provides resource group reserved resources
- quota restriction

Provide comprehensive distributed storage products

Rapidly build high-performance storage services

Parallel File Storage EPFS: -Access bandwidth of up to hundreds of gigabytes
-Meets the demands of hundreds to thousands of nodes accessing simultaneously

MPI-IO Level large scale muti node parallel reading and writing operations high-speed storage system
① Integrate to the platform, providing public storage service to users.
② Billing based on capacity.
③ Tenant security isolation, supporting multiple network types.
④ Provide visual interface for uploading files, downloading files, etc.
⑤ Automatically associated with the cluster by default, eliminating the need for manual configuration.
⑥ Expand or shrink required capacity on-demand

Object storage: - Building a unified data storage foundation
- Data lifecycle tiering strategy to reduce storage costs

① Storage model
② Storing data sets

File Storage NFS: - In inference scenarios, provide standard application interface for multi-GPU computing on multiple machines
- Consistency in multi-machine reading and writing, high aggregate throughput performance.

① General purpose file storage

Model Repository

- Queucalls can be made by task priority
- Provides resource group reserved resources
- quota restriction

01: GPU dispatch

02: One-click delivery of the overall environment

03: Model Warehouse Service

04: Docking of high-performance storage

AI Intelligent Computing Platform

Product Features (Key Capabilities)

Resource analysis and division

Just-in-Time Use

Scheduling principle

Multiple storage services

Provide comprehensive distributed storage products

Rapidly build high-performance storage services

Model Repository

Reference: Network deployment
architecture diagram

We look forward to

collaborating with you.

Product

Use case

AI Intelligent Computing Platform

Product Features (Key Capabilities)

Resource analysis and division

Just-in-Time Use

Scheduling principle

Multiple storage services

Provide comprehensive distributed storage products

Rapidly build high-performance storage services

Model Repository

Reference: Network deployment architecture diagram

We look forward to

collaborating with you.

Product

Use case

Reference: Network deployment
architecture diagram