DataScale SN40L rack System Administration

Copyright © 2020-2023 by SambaNova Systems, Inc. All contents are subject to a licensing agreement with SambaNova Systems, Inc. Any disclosure, reproduction, distribution, reverse engineering, or any other use made without the advance written permission of SambaNova Systems, Inc. is unauthorized and strictly prohibited. All rights of ownership and enforcement are reserved.

Table of Contents

1. Get started with DataScale SN40L rack administration

This SambaNova DataScale® hardware administration document targets the SN40L version of the SambaNova DataScale rack.

This page gets you started:

  • Learn about SambaNova support, SambaNova documentation, and other resources.

  • Get an overview of the DataScale hardware and software stacks.

See the DataScale hardware installation documentation for details on hardware installation requirements and tasks.

1.1. SambaNova support

SambaNova customers that have valid support contracts can contact support and obtain product support documentation through the SambaNova support portal at https://support.sambanova.ai.

1.2. SambaNova documentation

As part of hardware installation, you might need SambaNova documentation, SambaNova KBs, and third-party documentation.

1.3. Third-party documentation

For operational issues with the third-party components in the DataScale SN40L rack, see the following vendor-specific product documentation. If you need additional support or have troubleshooting questions related to troubleshooting, open a support case through SambaNova Support. See KB article #1017, "SambaNova Systems Support Best Practices," at https://support.sambanova.ai.

Do not open a support case with the product vendor.

1.4. Overview of DataScale SN40L rack hardware

The DataScale SN40L is self-contained in a standard 42 rack unit (RU) datacenter rack. Different configurations are available for purchase, depending on customer requirements (including data center requirements). System population begins at the bottom of the rack with node 1 and increments up the rack. Network switches and other equipment are installed at the top of the rack.

A DataScale SN40L rack system consists of:

  • SN40L-2 modules. Four DataScale SN40L-2 RDU modules. Each DataScale SN40L-2 module contains two Reconfigurable Data Units™ (RDUs), for a total of eight RDUs per DataScale SN40L rack system. The RDUs are managed by the SambaFlow software stack running on the host.

  • SN40L-H host module. An x86-based DataScale SN40L-H host module running either Red Hat® Enterprise Linux® or Ubuntu® Linux.

Both the DataScale SN40L-2 RDU module and the DataScale SN40L-H module are 2RU chassis.

Switch equipment at the top of the rack provides a data network and an access network by default. The following image and table identify the main components in the DataScale SN40L rack.

DataScale SN40L rack Components (front view)
Figure 1. DataScale SN40L rack components (front view)
Table 1. DataScale SN40L rack components
No. Component

1

System 1 SN40L-8 (SN40L-H)

2

System 1 SN40L-8 (four SN40L-2)

3

System 2 SN40L-8 (SN40L-H)

4

System 2 SN40L-8 (four SN40L-2)

5

Juniper® QFX5130 Ethernet (fan-side)

6

Lantronix® serial console server (Juniper EX series switch behind)

7

Juniper EX-series Ethernet switch (fan-side)

8

Lantronix® serial console server

1.5. SambaNova DataScale software stack

The software stack consists of the following components:

  • Host module OS. At the bottom of the stack is the host module OS, either RHEL or Ubuntu.

  • SambaFlow SambaFlow™ is a software stack that is running on SambaNova systems. This stack includes

    • SambaFlow Runtime. Responsible for communication with the DataScale hardware including hardware initialization, error handling, resource management, and interfacing with userspace processes requesting hardware resources.

    • Compilers. Proprietary compilers make your models available to the DataScale hardware.

    • SambaFlow Python SDK which developers use to create and run models.

The SambaFlow software is installed and executed on the SN40L-H host modules.

The following documentation describe the software stack, model development, and deployment:

1.5.1. DataScale SN40L host module OS

The DataScale SN40L rack includes two preinstalled OS (operating system) flavors that run on the DataScale SN40L-H host module on each system:

The SN40L-H host module supports the following OS versions:

  • Red Hat Enterprise Linux 8.5

  • Ubuntu Server 20.04.2 Long-Term Support (LTS)

Both images are preinstalled on each SN40L-H host module.

SambaNova provides updates for the OS images and updates for the software components through a repository. DataScale SN40L-H host module connectivity to the SambaNova repository is set up as part of the DataScale SN40L rack installation and relies on the site survey that your company completed. As part of the initial installation, SambaNova provides a sambanova.repo file that contains credentials and paths to your specific repository.

See KB article #1057 for details.

1.5.2. How to identify the SambaFlow software version (RHEL)

The command you run to identify the version of the SambaFlow software packages that are installed on the DataScale SN40L-H host modules depends on the OS that is running on the module.

To identify the software version on RHEL, run this command:

# dnf list installed | grep samba[nf]

The command results in output that starts like the following (the exact output depends on the SambaFlow version you are using):

sambaflow.x86_64                            1.12.7-15.el8
sambaflow-apps-datascale-image-unet.x86_64  1.12.7-15.el8
sambaflow-apps-starters-logreg.x86_64		1.12.7-15.el8
sambaflow-cpp.x86_64						1.12.7-15.el8
sambaflow-deps-capnproto.x86_64				0.8.0-1.el8
sambaflow-deps-isl.x86_64					0.22-1.el8
sambaflow-deps-pillow-simd.x86_64			7.2.0.post1-1.el8
sambaflow-deps-venv.x86_64					1.12.4-2.el8
sambaflow-exec.x86_64						1.12.7-15.el8
sambaflow-tools-llvm11.x86_64				11.0.0-3.rc1.el8
...

1.5.3. How to identify the SambaFlow software version (RHEL)

To identify the software version on Ubuntu Linux, run this command:

# apt list --installed | grep samba[nf]

The command results in output that starts like the following:

sambaflow-apps-datascale-language-transformers/focal,focal,now 1.13.0-2207251206 amd64
sambaflow-apps-starters-logreg/focal,focal,now 1.13.0-2207251206 amd64
sambaflow-cpp/focal,now 1.12.4-2203291247 amd64
sambaflow-deps-capnproto/focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,now 0.8.0-1 amd64
sambaflow-deps-pillow-simd/focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,focal,now 7.2.0.post1-1 amd64
sambaflow-exec/focal,focal,now 1.13.0-2207251206 amd64
...

1.6. Default username and passwords for components

The following table shows several components in the DataScale SN40L rack that have default passwords for users with administrative/root credentials. See Network administration for information on changing passwords for switches.

SambaNova highly recommends that you change these default passwords as soon as possible.
Do not use a slash character in a password for an XRDU. Both forward slash (/) and backward slash (\) can cause problems.
Table 2. Default usernames and passwords
Component Username Default password

Lantronix serial console server

sysadmin

Changeme

Juniper QFX5130 high-bandwidth Ethernet data switch

root

Changeme

Juniper EX series ccess switch

root

Changeme

DataScale SN40L-2/XRDU BMC

root

1Changeme

DataScale SN40L-H BMC

admin

Changeme
NOTE: Password must not exceed 14 characters.

DataScale SN40L-H OS

root

Changeme

DataScale SN40L-H OS

snuser1

Changeme

VertivTM PDU

admin

Changeme

By default, the operating system on SN40L-H is configured with a user snuser1 which has superuser privileges (i.e. can run sudo commands). The post-install test of the system uses this user to run example applications. For security reasons SambaNova recommends that you delete this user after the test is completed. You can then create your own users or configure the system to use a company-wide LDAP server.

2. Network administration

This page has information about network administration for the DataScale® SN40L rack.

  • Pointers to third-party documents for the network devices.

  • Instructions for changing passwords for the network devices.

  • Examples for the DataScale SN40L rack IP address assignments for the management, access, and data networks, as described in the DataScale hardware installation. The actual IP addresses depend on the subnets and host IP addresses in the Pre-Delivery Site Survey document that your company provided before delivery and installation of the DataScale SN40L rack.

In a single-node DataScale deployment, an amber light appears on port 16 of the QFX5130. This is expected behavior for this switch.

2.1. Network device administration

Most users do not configure the serial console server and the Juniper access switch. This topic discusses only tasks that you’re likely to perform and includes sample IP addresses. For more information:

2.1.1. Change default passwords for switches

SambaNova highly recommends that you change the default passwords for all components at first login. See Default Passwords.
Change password for Juniper EX series access switch and QFX5130 data switch
  1. Run the following command:

    $ ssh root@<Juniper_switch_IP_address>
    root@:RE:0% cli
    root> configure
    root# set system root-authentication plain-text-password
    root# commit
  2. Log out of the switch by using the exit command 3 times (exit config mode, exit operational mode, exit the Linux CLI)

  3. Log back in with the new password.

Change password for Lantronix SLC8000 serial console server:

Run the following command:

$ ssh sysadmin@<Lantronix_switch_IP_address>
> set localusers password sysadmin

2.1.2. Patch releases for network devices

SambaNova provides a periodic patch release for these network devices. You can download these patches from the SambaNova ext-infra-patch repository. See KB article #1062 "Listing and downloading available SN40L rack firmware" for details.

Patch release notes explain any steps that differ from the standard steps described in the specific product administration documentation.

2.2. IP address assignments for the access and management network

The management and access network share the same 1GbE switch but, depending on the customer requirements, they can be on the same network or two separate networks separated by VLAN. In the table below, the example IP addresses assume that the customer chose to merge the access and management networks into the same network.

Table 3 shows examples for the access and management network IP address assignments for components such as the BMC (baseboard management controller), the switch equipment, and the PDUs in the DataScale SN40L rack.

The information in the Example IP address (10.0.1.0/24) column assumes a customer who provided a 10.0.1.0/24 subnet. The IP address range starts at .16 in the last octet because some IPs are reserved for SambaNova usage. The addresses include placeholders for customer networking infrastructure like gateway IP.

Table 3. Access network IP address assignments
Example IP address (10.0.1.0/24) Component System #

10.0.1.1-4 Reserved for customer infra

-

-

10.0.1.5-15 Reserved for SambaNova

-

-

10.0.1.16

Serial console server

-

10.0.1.17

Access/Mgmt switch

-

10.0.1.18

Data switch

-

10.0.1.19

PDU 1

-

10.0.1.20

PDU 2

-

10.0.1.21

PDU 3

-

10.0.1.22

PDU 4

-

10.0.1.23

SN40L-H-1 OS (eth0)

System 1

10.0.1.24

SN40L-H-1 BMC

System 1

10.0.1.25

SN40L-H-1-XRDU0 BMC

System 1

10.0.1.26

SN40L-H-1-XRDU1 BMC

System 1

10.0.1.27

SN40L-H-1-XRDU2 BMC

System 1

10.0.1.28

SN40L-H-1-XRDU3 BMC

System 1

10.0.1.29

SN40L-H-2 OS (eth0)

System 2

10.0.1.30

SN40L-H-2 BMC

System 2

10.0.1.31

SN40L-H-2-XRDU0 BMC

System 2

10.0.1.32

SN40L-H-2-XRDU1 BMC

System 2

10.0.1.33

SN40L-H-2-XRDU2 BMC

System 2

10.0.1.34

SN40L-H-2-XRDU3

System 2

x.x.x.255
Broadcast IP address for network

-

-

2.3. IP address assignments for the data network

Table 4 shows examples for the high-bandwidth data network IP address assignments for the compute components in the DataScale SN40L rack.

The example IP addresses shown in the Example IP address (10.0.1.64/27) column assume a customer who provided a 10.0.1.64/27 subnet.

Table 4. Data network IP address assignments
Example IP address (10.0.1.64/27) Component System #

x.x.x.1-4
Reserved for customer infra.

-

-

10.0.2.5

SN40L-H-1 snhni0

System 1

10.0.2.6

SN40L-H-2 snhni0

System 2

10.0.2.31
Broadcast IP address for network

-

-

3. DataScale SN40L power management

For proper operation of the DataScale® SN40L rack and to prevent issues, be sure you power on and power off the system appropriately and in the correct sequence, as described on this page.

3.1. Warnings and general notes

The following notices apply to the DataScale SN40L rack.

Some components within the rack work at high voltage. To prevent personal injury and voiding of the warranty, do not attempt to service components except where noted.
To protect the DataScale SN40L rack from interference and to prevent damage to its components, keep the front and rear rack doors closed during standard operation.
To prevent DataScale SN40L rack components from overheating, keep the front and rear of the rack clear of obstructions to allow proper airflow.
Do not power off or reboot the DataScale SN40L rack components during any firmware update procedure. Doing so might damage the DataScale SN40L rack components, and damaged components might not be recoverable. Perform a shutdown or reboot only after a firmware update has been completed.
When the PDUs are physically connected to the datacenter’s power receptacles and power is applied to the rack, all DataScale SN40L rack components begin to power on. The fans of these components initially run at full speed but eventually ramp down after the BMCs finish their boot sequence. Power is not immediately applied to the rack components because the breakers on the PDUs are turned off. You must manually turn on these breakers to begin feeding power to the DataScale SN40L rack components.

3.2. Process overview

To avoid damage to the system, perform the power-on procedure or a graceful shutdown in the correct order.

  • To turn on the DataScale SN40L rack, follow the detailed steps below. Here’s an overview:

    1. Power on the DataScale SN40L rack by turning on the circuit breakers for each PDU.

    2. Boot the DataScale SN40L-2 RDU modules.

    3. Boot the DataScale SN40L-H host module.

  • To gracefully shut down the DataScale SN40L rack, follow the detailed steps in Gracefully shutting down the DataScale SN40L rack. Here’s an overview:

    1. Shut down the SN40L-H host modules.

    2. Shut down the DataScale SN40L-2 RDU modules.

3.3. Power on the DataScale SN40L rack

Power on the DataScale SN40L-2 RDU modules before you power on the DataScale SN40L-H host modules, as described in the following steps.
  1. Turn on the six circuit breakers for each PDU.

    When the PDUs are plugged into the datacenter power and you close the circuit breakers, power is automatically applied to the DataScale SN40L rack components. Circuit breakers on PDU shows what a PDU circuit breaker group looks like and shows breaker switch 6 circled. Each PDU has a bank of three circuit breakers grouped together.

    Circuit breaker on PDU
    Figure 2. Circuit breakers on PDU

    The DataScale SN40L-H host modules and DataScale SN40L-2 RDU modules boot into standby mode and wait to be manually powered on. The BMC/service processors are powered on through these devices. The networking equipment in the rack does not go into standby mode; instead, it completely boots when power is established.

    SambaNova uses networking equipment from other suppliers. See Third-party documentation.

3.4. Boot the DataScale SN40L-2 modules

Boot the DataScale SN40L-2 RDU modules by using SSH to connect to the SN40L-2 BMC, or by sending an API call to the SN40L-2 BMC. This section includes steps for both options.

3.4.1. Option 1: Use SSH to connect to the SN40L-2 BMC

  1. From a system that has access to the DataScale SN40L rack access network, open a terminal session and use ssh to securely connect to the first DataScale SN40L-2 RDU module in each system.

    See the IP address assignment information in Network administration or use your customer-specific IP assignment worksheet to get the IP address to connect to. The first DataScale SN40L-2 RDU module in each system is as follows:

    System 1: SN40L-2-1 (SN40L-H-1-XRDU0)

    System 2: SN40L-2-5 (SN40L-H-1-XRDU0)

    Here’s an example for system 1 that assumes IP address subnet 10.0.1.0/26 for the access network:

    $ ssh root@10.0.1.25
    root@10.0.1.25’s password: <Enter root password>
    root@xrdu:~#
  2. Run the following xrduutil command to power on the system:

    root@xrdu:~# xrduutil -U root -P <root_password> poweron
  3. To ensure the DataScale SN40L-2 RDU modules are up before you boot the DataScale SN40L-H host module, check the status of each module by running this command:

    root@xrdu:~# xrduutil -U root -P <root_password> powerstate
    Power is on for XRDU_0
    Power is on for XRDU_1
    Power is on for XRDU_2
    Power is on for XRDU_3

3.4.2. Option 2: Send a REST API call to the SN40L-2 BMC

  1. Generate a token (recommended). If you use the REST API, SambaNova recommends that you use token-based authentication so that plain-text passwords are not sent over the network for REST API commands. See Generate a secure API login token for details.

  2. Run the REST API power-on command for each DataScale SN40L-2 RDU module. Run this command for each DataScale SN40L-2 RDU module in each of the nodes, in no particular order.

    Format:

    $ curl -b cjar -k -H "X-Auth-Token: $token" -X PUT -d '\{"data":"xyz.openbmc_project.State.Chassis.Transition.On"}' https://<SN40L-2_BMC_IP>/xyz/openbmc_project/state/chassis0/attr/RequestedPowerTransition

    Example:

    $ curl -b cjar -k -H "X-Auth-Token: $token" -X PUT -d '\{"data":"xyz.openbmc_project.State.Chassis.Transition.On"}' https://10.0.1.21/xyz/openbmc_project/state/chassis0/attr/RequestedPowerTransition
  3. To ensure the DataScale SN40L-2 RDU modules are up before you boot the SN40L-H, run the following command against each of the DataScale SN40L-2 RDU modules:

    Format:

    $ curl -b cjar -k -H "X-Auth-Token: $token" https://<SN40L-2_BMC_IP>/xyz/openbmc_project/state/chassis0

    Example:

    $ curl -b cjar -k -H "X-Auth-Token: $token" https://10.10.0.25/xyz/openbmc_project/state/chassis0

    After an SN40L-2 RDU module is powered on, the output looks similar to the following:

    {
    "data": {
    "CurrentPowerState": "xyz.openbmc_project.State.Chassis.PowerState.On",
    "LastStateChangeTime": 1591197275103,
    "POHCounter": 75,
    "RequestedPowerTransition": "xyz.openbmc_project.State.Chassis.Transition.On"
    },
    "message": "200 OK",
    "status": "ok"
    }

3.4.3. Option 3: Mechanical power-on

To power on the SN40L-2 modules:

  1. Press the power button located on the front panel of the SN40L-2 for 5 seconds. This panel is located on the front left side of the system. The power button is identified as item 1 in SN40L front panel (annotated).

  2. Wait for the system LED (callout item 2) to change from a blinking to a solid green light.

    SN40L front panel
    Figure 3. SN40L front panel (annotated)
  3. When the system LED is no longer blinking, the SN40L-2 modules are being powered on. This power on process can take up to a minute.

  4. Repeat the process for each SN40L-2 module in the SN40L-8 node.

3.5. Power on the DataScale SN40L-H host module

To ensure that the DataScale SN40L-H host module populates the system device tree properly, power on the host module only after the DataScale SN40L-2 RDU modules are powered on fully.

You can boot the DataScale SN40L-H host module using one of these options:

3.5.1. Option 1: Mechanical power on

To power on the SN40L-H host module, press the power button located on the front panel of the SN40L-H. This panel is located on the front left side of the server.

Power button

3.5.2. Option 2: Power on via IPMI

Run the following command from a system that has ipmitool installed and that has access to the SN40L-H host module’s BMC via the access network.

$ ipmitool -I lanplus -H <SN40L-H_BMC_IP_Address> -U root -P <root password> power on

3.5.3. Option 3: Power on via WebUI

To power on via WebUI your system must meet the following requirements:

  • Access to the DataScale SN40L-H host module’s BMC via the access network

  • One of the following supported web browsers:

    • Chrome (latest version)

    • Firefox (latest version)

Follow these steps:

  1. Open a web browser and enter the IP address of the DataScale SN40L-H host module’s BMC in the address bar.

  2. Log in to the management console by entering the user credentials and click Sign me in.

  3. Select Power Control from the BMC dashboard.

    Dashboard]

  4. Select the Power On checkbox, and then click Perform Action.

    Power On

  5. Perform this boot sequence for all nodes in the DataScale SN40L rack. The order in which you bring up the nodes does not matter.

3.6. Gracefully shutting down the DataScale SN40L rack

You can shut down the DataScale SN40L rack but not completely power off the entire rack. Follow these steps for each node in the rack.

3.6.1. Shut down the SN40L-H host modules

Shut down the SN40L-H host module in each system by using one of the following methods:

Option 1: Shut down from the OS

Log in to the node via ssh as snuser1 and initiate a shutdown command.

$ ssh snuser1@<SN40L-H_OS_IP_Address>
snuser1@SN40L-H1’s password: <password>
$ sudo shutdown

This command does not shut down the system immediately but waits about a minute for users to save their work.

Option 2: Power off via IPMI
  1. Ensure that your system has:

    • Access to the SN40L-H host module’s BMC via the access network

    • The ipmitool installed

  2. Run the following command:

$ ipmitool -I lanplus -H <SN40L-H_BMC_IP_Address> -U root -P <root password> power off
Option 3: Power off via WebUI

To power off via WebUI, your system must meet the following requirements:

  • Access to the DataScale SN40L-H host module’s BMC via the access network

  • One of the following supported web browsers:

    • Chrome (latest version)

    • Firefox (latest version)

Follow these steps:

  1. Open a web browser and enter the IP address of the DataScale SN40L-H host module’s BMC in the address bar.

  2. Log in to the management console with your user credentials and click Sign me in.

  3. Select Power Control from the BMC dashboard.

  4. In the Power Actions screen, select the Power Off checkbox and click Perform Action.

    Power Off

3.6.2. Shut down the DataScale SN40L-2 RDU modules

Shut down the DataScale SN40L-2 RDU modules in the node using SSH or a REST API call, as follows:

Option 1: Use SSH to connect to the DataScale SN40L-2 BMC
  1. Open a terminal session from a system that has access to the DataScale SN40L rack access network.

  2. Use ssh to connect to the first DataScale SN40L-2 in each node.

    To get the IP address to connect to, see the IP address assignment information in Network administration or use your customer-specific IP assignment worksheet. The first DataScale SN40L-2 RDU module in each system is as follows:

    System 1: SN40L-2-1 (SN40L-H-1-XRDU0)

    System 2: SN40L-2-5 (SN40L-H-1-XRDU0)

    Example for system 1 given IP address subnet 10.0.1.0/26 for the access network:

    $ ssh root@10.0.1.25
    root@10.0.1.25's password: <Enter root password>
    root@xrdu:~#
  3. Run the xrduutil poweroff command:

    root@xrdu:~# xrduutil -U root -P <root_password> poweroff
Option 2: Send a REST API call to the DataScale SN40L-2 BMC

You can perform the shutdown using the REST API power-off command.

SambaNova recommends that you use token-based authentication so that you do not send plain-text passwords over the network when you use REST commands. See Generate a secure API login token.
  1. Run the REST API power-off command for each of the DataScale SN40L-2 RDU modules in each of the systems.

    Format:

    $ curl -b cjar -k -H "X-Auth-Token: $token" -X PUT -d '\{"data":"xyz.openbmc_project.State.Chassis.Transition.Off"}' https://<SN40L-2_BMC_IP>/xyz/openbmc_project/state/chassis0/attr/RequestedPowerTransition

    Example:

    $ curl -b cjar -k -H "X-Auth-Token: $token" -X PUT -d '\{"data":"xyz.openbmc_project.State.Chassis.Transition.Off"}' https://10.0.1.25/xyz/openbmc_project/state/chassis0/attr/RequestedPowerTransition
  2. Shut down the Juniper QFX5130 high-bandwidth data switch, the Lantronix SLC8000 serial console server, and the Juniper EX series access switch.

    Shut down the Juniper EX series access switch last when you power down the entire DataScale SN40L rack. That switch controls the final access to the system via the network.

    See the product-specific documentation listed under Third-party documentation for details on how to shut down each of these switches.

After shutting down the switches, you can no longer access the PDUs to cycle outlets because their network switch is down. You have to break and manually remake the relevant breakers from the physical PDU to properly cycle power.

4. Host module OS administration

Administrative tasks differ depending on which supported OS you are running on each of the SN40L-H host modules.

4.1. Supported versions of the SN40L-H operating systems

The SN40L-H host module supports the following OS versions:

  • Red Hat Enterprise Linux 8.5

  • Ubuntu Server 20.04.2 Long-Term Support (LTS)

4.2. General notes and warnings

Some third-party software and OS packages may prevent the SambaFlowTM software stack from functioning properly. In this case, SambaNova Support may require all non-certified third-party software or non-certified packages, including the package version, to be removed to get the DataScale® SN40L-H host module to a satisfactory state and to continue working on any support issues.
DataScale SN40L-H host modules are configured with a default login password for users root and snuser1. SambaNova strongly recommends that you change these passwords immediately after logging in to a DataScale SN40L-H host module.
SambaNova strongly recommends that you do not perform a major upgrade or a kernel update to the DataScale SN40L-H host module OS without referring to the supported OS, kernel, and package versions noted within this document because the SambaNova software relies on some strict packages dependencies. SambaNova recommends that you do not perform any major updates unless you are directed to do so by SambaNova.
Before you perform Linux package updates, ensure there are no package dependencies that might break the SambaFlow software if the packages are not at the correct level.

4.3. Licensing

SambaNova provides the package repositories for Red Hat Enterprise Linux and for Ubuntu running on the DataScale SN40L rack.

  • SambaNova has a partnership with Red Hat that allows SambaNova to distribute a customized repository for the DataScale SN40L rack.

  • SambaNova has a partnership with Ubuntu that allows SambaNova to distribute a customized repository for the DataScale SN40L rack.

Adding other repositories can cause issues with the operation of the SambaFlow software because of some package and kernel version dependencies.

If the SambaNova software stack has problems running, SambaNova Support might request that you remove any packages that were not originally included from your Linux repository or that you downgrade certain packages to a version that was certified.

4.4. Login process

To access the DataScale SN40L-H host module for the first time:

  1. Find a system that can access the DataScale SN40L rack access network. The access network might be combined with the management or data network.

  2. Use ssh as user snuser1 to log in to the DataScale SN40L-H host module.

  3. Enter the default password for snuser1 when prompted. See Default username and passwords for components.

$ ssh snuser1@<SN40L-H_OS_IP_Address>
snuser1@<SN40L-H_OS_IP_Address>’s password: <Default Password>

SambaNova strongly recommends that you change the default password for root and snuser1. To change the snuser1 password, run the following command and enter the new password when prompted:

$ passwd
Changing password for snuser1.
(current) UNIX password: <Current_Default_Password>
Enter new UNIX password: <New_Secure_Password>
Retype new UNIX password: <New_Secure_Password>
passwd: password updated successfully

4.5. Connect to the SambaNova OS repository

DataScale SN40L-H host module connectivity to the SambaNova repository is set up as part of the DataScale SN40L rack installation and relies on the site survey that your company completed. As part of the initial installation, SambaNova provides a sambanova.repo file that contains the credentials and paths to your specific repository.

If you need to check the setup for the SambaNova OS repository, see KB article #1057.

4.6. OS repository configuration file

Do not modify the sambanova.repo repository file. Doing so can break SambaFlow software package dependencies, which might cause unrecoverable package dependency issues. You might have to rebuild the SN40L-H host module as a result. If you need any packages that are not provided by SambaNova, open a support case with SambaNova Support.

4.7. Updating the DataScale SN40L-H host module OS

SambaNova patch releases handle major upgrades to the DataScale SN40L-H host module OS, for example:

  • Going from RHEL 8.5 to RHEL 8.6 or later

  • Going from 20.04 LTS to 22.04 LTS

  • Kernel updates.

4.8. Updating the SambaFlow software

To update the SambaFlow software packages, log in to the DataScale SN40L-H host module(s) where the software packages need to be updated. The commands you run depend on the OS you’re using.

4.8.1. Update SambaFlow on RHEL

To view what packages are installed on the DataScale SN40L-H host module, run the following command:

$ dnf list installed | grep samba[nf]

To view which SambaFlow packages have an update that you can apply, run the following command:

$ dnf check-update | grep samba[nf]

To update the SambaFlow packages, examine the check-update command output, and then run the following command to update a package and any package dependencies:

$ sudo dnf update <package_name>

For example, if the output produced by the check-update command shows that an update is available for the sambaflow package, run the following command:

$ sudo dnf update sambaflow

Repeat this step for each package that needs to be updated. Due to package dependencies, updating one package might update several other packages.

4.8.2. Update SambaFlow on Ubuntu

To update the SambaFlow software packages, log in to the DataScale SN40L-H host module(s) where the software packages need to be updated.

To view what packages are installed on the DataScale SN40L-H host module, run the following command:

$ dpkg -l | grep samba[nf]

To view which SambaNova packages have an update you can apply, run the following command:

$ apt list --upgradable | grep samba[nf]

To update all the packages that need to be updated, run the following command, which updates the packages and any package dependencies:

$ sudo apt install --only-upgrade samba[nf]

To update a specific package, replace samba[nf] with the name of a specific package. For example, to update sambaflow, run the following command:

$ sudo apt install --only-upgrade sambaflow

5. BMC administration

When security patches are available or when BMC firmware updates are required for other reasons, you can perform the tasks in this section. Updating the BIOS is included with this BMC administration topic because the two tasks are usually performed at the same time. The tasks include:

  • Updating the DataScale® SN40L-H host module BMC firmware

  • Updating the DataScale SN40L-H host module BIOS

  • Recovering the DataScale SN40L-H BMC

See View SN40L-H BMC diagnostic information and logs for information on diagnostics.

5.1. General notes and warnings

Do not remove the admin user account or change this account’s password. This account is needed for password recovery of the DataScale SN40L-H host module’s BMC.
Do not power off or reboot the DataScale SN40L rack components during firmware updates. Interrupting a firmware update can damage the DataScale SN40L rack components. The damaged component might not be recoverable. Perform a shutdown or reboot only after a firmware update has been completed successfully.
Settings on the BMCs do not need modification and remain static unless you are updating the BMCs, collecting diagnostic material, or changing the log in credentials. Do not make configuration changes to the BMC unless you are otherwise instructed.

5.2. Updating the DataScale SN40L-H host module BMC firmware

If you start the firmware update process and you cancel the process, you must reset BMC. To do that, close the web browser that was logged in to the BMC WebUI, and then log in to the BMC WebUI again before you attempt any administrative operations for the BMC.

5.2.1. Back up the existing configuration

Before you update the firmware, back up the existing configuration of the DataScale SN40L-H host module. Having a backup might help with recovering the BMC.

To back up the existing configuration, your system must meet the following requirements:

  • Access to the DataScale SN40L-H host module’s BMC via the access network

  • One of the following supported web browsers:

    • Chrome (latest version)

    • Firefox (latest version)

Follow these steps to back up the existing configuration:

  1. Open a web browser and enter the IP address of the DataScale SN40L-H host module’s BMC in the address bar.

  2. Log in to the management console with your user credentials, and click Sign me in.

  3. In the left pane of the dashboard, select Maintenance.

  4. On the Maintenance screen, select Backup Configuration.

    Maintenance screen

  5. On the Backup Configuration screen, select Check All to back up all the BMC configuration details.

    Backup Configuration screen

  6. Click Download to save this configuration to the local system (which is accessing the BMC WebUI).

  7. Click OK to download the bmc-config.bak backup configuration file. You can use that file later if a restore is required.

5.2.2. Update the host module BMC firmware

After you back up the BMC configuration, you can update the SN40L-H host module’s BMC firmware while preserving the configuration. Follow these steps:

  1. Download the DataScale SN40L-H host module’s BMC patch update from the SambaNova Support portal to the local system that is accessing the BMC WebUI.

  2. Unzip the SambaNova patch update to a directory on the local system.

  3. On the Backup Configuration screen, select Maintenance in the left pane.

  4. On the Maintenance screen, select Preserve Configuration.

    Maintenance screen

  5. In the Preserve Configuration screen, select Check All at the top of the list to preserve the configuration of everything.

    The following message appears if the configuration preservation was successful.

    Success message

  6. In the left pane, click Maintenance and select Firmware Update in the Maintenance screen.

  7. Find the rom.ima_enc file:

    1. In the Firmware Update screen, click Browse.

    2. Navigate to the .bin file that you downloaded and unzipped. This file is located in the /SN40L rack/<version>/HostBMC_FW/ directory from the unzipped patch bundle.

    3. Select the rom.ima_enc file and click Open.

  8. Back in the Firmware Update screen, click Start firmware update.

    Firmware Update screen

  9. Below the the button that you just clicked, select the Preserve all Configuration checkbox to use the preserved configuration you saved.

    Preserve all Configuration

  10. Scroll to the bottom of the screen and click Proceed to Flash.

    Proceed to Flash

  11. Click OK in the BMC update confirmation screen.

    When the BMC update process has started, the BMC is not reachable for 5 to 10 minutes while the update is being applied. The DataScale SN40L-H host module OS continues to run normally during the BMC update.

  12. After 10 minutes, repeat step 2 to log in to the BMC WebUI and check the information in the upper left to confirm that the update was successful. The BMC firmware version is identified as <XX.XX.X>.

    BMC firmware version

5.3. Update the DataScale SN40L-H host module BIOS

Ensure the update process is not interrupted! When you enter the update mode, all open widgets are closed automatically and other web pages and services no longer work. If you cancel the upgrade in the middle of the process, the SN40L-H host module will be reset only for the BMC BOOT and APP components of the firmware.
The SN40L-H host module BIOS update requires a reboot of the system to apply the updated BIOS. Plan accordingly.

To update the SN40L-H host module BIOS, your system must meet the following requirements:

  • Access to the DataScale SN40L-H host module’s BMC via the access network

  • One of the following supported web browsers:

    • Chrome (latest version)

    • Firefox (latest version)

Follow these steps to perform the update:

  1. Open a web browser and enter the IP address of the SN40L-H host module’s BMC in the browser’s address bar.

  2. Enter your user credentials, and click Sign me in.

  3. In the dashboard, select Maintenance.

  4. In the Maintenance screen, select Firmware Update.

    Maintenance screen

  5. Find the image.RBU file:

    1. In the Firmware Update screen, click Browse.

      Firmware Update screen

    2. Navigate to the /Host_BIOS/RBU/ directory of the uncompressed infrastructure patch bundle.

    3. Select the image.RBU file and click Open.

  6. Back in the Firmware Update screen, click Start firmware update.

    Firmware Update screen

  7. Below the Start firmware update button, select BIOS from the Update Type drop-down.

    Update Type drop-down

  8. Click Proceed to Flash and click OK.

    This initiates uploading the BIOS firmware update to the DataSale SN40L-H host module, but it does not automatically apply the firmware update.

  9. When the screen shows Uploading 100%, click Flash BIOS to initiate the BIOS update process.

    Flash BIOS button

  10. When the flash process is complete, a “firmware image has been updated successfully” message appears. Click OK to continue.

  11. A "Firmware reset has been called" message appears. Click OK to log out of the SN40L-H BMC WebUI and follow the steps in Section 5.3.1 to reset the host model OS.

5.3.1. Reset the host module OS

As a final step, you have to reset the host module OS.

  1. After you are logged out of the SN40L-H BMC, log in to the SN40L-H OS.

    $ ssh snuser1@<SN40L-H_OS_IP_Address>
    snuser1@<SN40L-H_OS_IP_Address>’s password: <snuser1 Password>
  2. From the command line, reset the SN40L-H OS to complete the BIOS update.

    $ sudo shutdown -r now
    [sudo] password for snuser1: <snuser1 Password>
  3. When the SN40L-H host module is back online, confirm that the BIOS update has been applied, as follows:

    1. Log in to the SN40L-H BMC and select Maintenance from the left pane of the dashboard.

    2. In the Maintenance screen, select Firmware Information.

      Maintenance screen

    3. Check the BMCFirmware Information section and the BIOS Firmware Information to confirm that the upgrade was successful.

      BMCFirmware Information screen

5.4. Recover the DataScale SN40L-H BMC

If the DataScale SN40L-H host module’s BMC is no longer responding or no longer accessible, or the DataScale SN40L-H host module’s BMC password has been lost or forgotten, see Backing up and restoring components.

6. DataScale SN40L RDU module administration

Administrative tasks for the DataScale® SN40L-2 RDU module include the following:

  • Changing the root password

  • Generating a secure API login token for authentication

  • Updating the DataScale SN40L-2 BMC and RDU controller (RDU-C) firmware

  • Configuring the DataScale SN40L-2 BMC network

  • Configuring the DataScale SN40L-2 BMC hostname

There is a built-in secure account on the DataScale SN40L-2 BMC called snservice. It is used for password recovery of root if the password is forgotten. For more details on this account, refer to KB article #1049.

6.1. Change the root password

SambaNova highly recommends that you change the default password for root to a more secure password.
Passwords cannot be based on dictionary words and cannot include the # character. If you use a dictionary word, a BAD PASSWORD message results, and the password is not changed.

To change the default password for root on the DataScale SN40L-2 BMC, follow these steps:

  1. Log in to the DataScale SN40L-2 BMC where you transferred the update files:

    $ ssh root@<SN40L-2_BMC_IP_Address>
    Password: <Enter root password>
  2. Run the passwd command and enter a new password, as follows:

    root@xrdu:~# passwd
    New password: <New Password>
    Retype new password: <New Password>
    passwd: password updated successfully

6.2. Generate a secure API login token

You can generate a secure token for the DataScale SN40L-2 BMC root user to prevent the need to use plain-text passwords in REST API calls.

  1. Log in to the client system from which you want to run the REST API calls. The system must have network access to the DataScale SN40L-2 BMC.

  2. Run the following command to generate the token. Replace <SN40L-2_BMC_IP_Address> and <Password> with the appropriate values:

    $ export token=`curl -k -H "Content-Type: application/json" -X POST https://<SN40L-2_BMC_IP_Address>/login -d '\{"username" : "root", "password" : "<Password>"}' | grep token | awk '\{print $2;}' | tr -d '"'`
  3. Confirm that a token has been generated for your session:

    $ echo $token
    1h0Dk9xjtjsOtBkMhgIN
  4. To validate that the token works from the client system, run the following curl command. Replace <SN40L-2_BMC_IP_Address> with the correct DataScale SN40L-2 BMC IP address.

    $ curl -k -H "X-Auth-Token: $token" https://<SN40L-2_BMC_IP_Address>/xyz/openbmc_project/
    {
    "data":
    "/xyz/openbmc_project/Ipmi",
    "/xyz/openbmc_project/certs",
    ...
    "/xyz/openbmc_project/user"
    ],
    "message": "200 OK",
    "status": "ok"
    }

    If you execute the curl command correctly and output that’s similar to the example is generated, the token works correctly. You can now use the token with other API calls, for example, to power on and power off the DataScale SN40L-2 RDU module.

6.3. How to Update the BMC and RDU controller (RDU-C) firmware

Updating the DataScale SN40L-2 BMC and RDU controller (RDU-C) firmware consists several tasks, which must be done in sequence.

6.3.1. Prepare the DataScale SN40L-2 BMC primary partition for update

To prepare the primary partition and download the files, follow these steps:

  1. Shut down the DataScale SN40L-H host module in the system. This ensures that no models or other processes are running. See Gracefully shutting down the DataScale SN40L rack.

  2. Shut down the DataScale SN40L-2 RDU module. See Gracefully shutting down the DataScale SN40L rack.

  3. Log in to the DataScale SN40L-2 BMC and reboot the BMC to clear the BMC registers, as follows:

    $ ssh root@<SN40L-2_BMC_IP_Address>
    Password: <Enter root password>
    
    root@xrdu:~# reboot
  4. Wait until the reboot process completes (3-5 minutes).

  5. Download the DataScale SN40L-2 firmware update file sn<XRDU_version>-xrdu-sys-fw-<fw_version_number>.tar.gz from the SambaNova ext-xrdu-fw repository, under the /latest sub-directory, to a system that has access to the network that the DataScale SN40L-2 BMC is on. For details on accessing these required firmware files, see the KB Article #1063.

Ensure that you download the XRDU firmware for DataScale SN40L and not the firmware for a different DataScale versions.
  1. Uncompress the sn<XRDU_version>-xrdu-sys-fw-<fw_version_number>.tar.gz file.

  2. Copy the .mtd and .mtd.md5 firmware files from the obmc/ directory to each of the DataScale SN40L-2 BMCs that are to be updated. Place these files under the /dev/shm/ directory on the SN40L-2.

    $ scp /<uncompressed directory>/obmc/obmc-<version>* root@<SN40L-2_BMC_IP_Address>:/dev/shm/
    Password: <Enter root password>

    Confirm that the .mtd and .mtd.md5 files have been completely transferred to the BMC’s /dev/shm/ directory.

    Ensure that the files copied over are from the rdu-128 directory and not the rdu-64 directory.
  3. Log in to the DataScale SN40L-2 BMC where the update files were transferred to.

    $ ssh root@<SN40L-2_BMC_IP_Address>
    Password: <Enter root password>
    
    root@xrdu:~# cd /dev/shm/
  4. Confirm that the following two files are located in this directory:

    • obmc-rdu-<version>.mtd

    • obmc-rdu-<version>.mtd.md5

    root@xrdu:/dev/shm# ls obmc*
    obmc-<version>.mtd  obmc-<version>.mtd.md5

6.3.2. Perform the update on the primary partition

After you confirm that the two files are available, perform the update as follows:

  1. Run the update on the obmc-rdu-<version>.mtd firmware file.

    root@xrdu:~# obmcupdate -p primary -t bmc -f /dev/shm/obmc-rdu-<version>.mtd

    Do not run any other commands or disconnect the power supply at this time .

  2. Confirm that the Erasing, Writing, and Verifying stages complete to 100%.

  3. When all stages are completed, reboot the BMC with the new firmware.

    root@xrdu:~# reboot -f
  4. After about 3 to 5 minutes, log in to the DataScale SN40L-2 BMC.

    $ ssh root@<SN40L-2_BMC_IP_Address>
    Password: <Enter root password>
    The update reimages the DataScale SN40L-2 BMC and the .ssh identification will likely have changed. You might be prompted to remove the old host entry in the .ssh/known_hosts file on the client that was used to ssh into the system before.
  5. Confirm the update has been running and compare the version output to the DataScale SN40L-2 BMC firmware patch applied, as follows:

    root@xrdu:~# obmcupdate -i
    ***** RDU-C *****
    RDU-C Release Version: <current version>
    RDU-C BuildDate: #.## ####   DesignVer: ##   BoardID: ##.
    ***** BMC *****
    BMC Release Version: <updated version>
    BMC BUILD ID: <updated BMC buildid>
    BMC Flash: Primary
    BMC Flash Size: 128MB
  6. If there are any issues running the update, run the obmcupdate command again.

If the update process continues to fail, contact SambaNova Support.

6.3.3. Update the DataScale SN40L-2 BMC secondary/recovery partition

The re-imaging of the BMC removes the obmc-rdu-<version>.mtd and obmc-rdu-<version>.mtd.md5 files from /dev/shm/.

  1. Exit out of the SN40L-2 BMC and log back in to the client system where the BMC firmware files were uncompressed.

  2. Copy the obmc-rdu-<version>.mtd and obmc-rdu-<version>.mtd.md5 firmware files back to the DataScale SN40L-2 BMCs /dev/shm/ directory.

    $ scp /<uncompressed directory>/obmc/obmc-<version>* root@<SN40L-2_BMC_IP_Address>:/dev/shm/
    Password: <Enter SN40L-2 BMC root password>
  3. Confirm that these two files have been completely transferred to the BMC’s /dev/shm/ directory.

  4. Log back in to the DataScale SN40L-2 BMC that was just updated:

    $ ssh root@<SN40L-2_BMC_IP_Address>
    Password: <Enter root password>
  5. Go to the /dev/shm/ directory on the DataScale SN40L-2 BMC.

    root@xrdu:~# cd /dev/shm/
  6. Confirm that the following two files are located in this directory:

    • obmc-rdu-<version>.mtd

    • obmc-rdu-<version>.mtd.md5

      root@xrdu:/dev/shm# ls obmc*
      obmc-rdu-<version>.mtd  obmc-rdu-<version>.mtd.md5
  7. Run the update on the BMC recovery partition using the obmc-rdu-<version>.mtd firmware file.

    root@xrdu:~# obmcupdate -p recovery -t bmc -f /dev/shm/obmc-rdu-<version>.mtd

    Do not run any other commands or disconnect the power supply at this time.

  8. Confirm that the Erasing, Writing, and Verifying stages complete to 100%.

  9. If there are any issues running the update, run the update command once more. If the update process continues to fail, contact SambaNova Support.

When the update is completed, you can update the DataScale SN40L-2 RDU Controller (RDU-C) primary partition.

6.3.4. Update the DataScale SN40L-2 RDU-C primary partition

After you’ve update both primary and secondary partition of the SN40L-2 BMU, you can update the SN40L-2 RDU-C.

  1. Exit out of the SN40L-2 BMC and log back in to the client system where the BMC and RDU-C firmware files were uncompressed.

  2. Copy the following firmware files to the DataScale SN40L-2 BMCs /dev/shm/ directory:

    • rduc-<version>-primary.spi

    • rduc-<version>-primary.spi.md5

    • rduc-<version>-recovery.spi

    • rduc-<version>-recovery.spi.md5

      $ scp /<uncompressed directory>/rduc/rduc-<version>-* root@<SN40L-2_BMC_IP_Address>:/dev/shm/
      Password: <Enter SN40L-2 BMC root password>
  3. Log in to the DataScale SN40L-2 BMC to which the update files were transferred.

    $ ssh root@<SN40L-2_BMC_IP_Address>
    Password: <Enter root password>
  4. Go to the /dev/shm/ directory on the DataScale SN40L-2 BMC.

    root@xrdu:~# cd /dev/shm/
  5. Confirm that the following files are located in this directory:

    • rduc-<version>-primary.spi

    • rduc-<version>-primary.spi.md5

    • rduc-<version>-recovery.spi

    • rduc-<version>-recovery.spi.md5

      root@xrdu:/dev/shm# ls rduc*
      rduc-<version>-primary.spi  rduc-<version>-primary.spi.md5  rduc-<version>-recovery.spi
      rduc-<version>-recovery.spi.md5
  6. Run the update using the primary.spi firmware file to update the DataScale SN40L-2 RDU-C primary partition.

    root@xrdu:/dev/shm# obmcupdate -p primary -t rduc -f /dev/shm/rduc-<version>-primary.spi

    Do not run any other commands or disconnect the power supply at this time.

  7. Confirm that the update of the RDU-C has taken affect by running the obmcupdate -i command.

    root@xrdu:~# obmcupdate -i
    ***** RDU-C *****
    RDU-C Release Version: <updated version>
    RDU-C BuildDate: #.## ####   DesignVer: ##   BoardID: ##
    ***** BMC *****
    BMC Release Version: <updated version>
    BMC BUILD ID: <updated build id>
    BMC Flash: Primary
    BMC Flash Size: 128MB

    Verify that the RDU-C Release Version appears as the updated version.

6.3.5. Update the DataScale SN40L-2 RDU-C secondary/recovery partition

  1. To update the the DataScale SN40L-2 RDU-C recovery partition, run the obmcupdate command with the rduc-<version>-recovery.spi firmware file.

    root@xrdu:/dev/shm# obmcupdate -p recovery -t rduc -f /dev/shm/rduc-<recovery>-recovery.spi
  2. If any issues occur during the update of the DataScale SN40L-2 BMC or RDU-C, contact SambaNova support

After the DataScale SN40L-2 BMC and RDU-C have successfully been updated, it is safe to power on the DataScale SN40L-2 and then the SN40L-H modules. See the Power on the DataScale SN40L rack procedure.

6.4. Configure the DataScale SN40L-2 BMC network

When you change the IP address of the DataScale SN40L-2 BMC, you have to update the IP_ADDRESS_SP# entries in the /platform/network.json files for the updated DataScale SN40L-2 BMC and update other DataScale SN40L-2 BMCs that are directly connected to the updated DataScale SN40L-2 BMC in the node.
After changing the IP address and resetting the network service, currently connected ssh sessions are terminated or left in a hung state because the network IP connection has changed. Log in to the DataScale SN40L-2 BMC using the new IP address.

DataScale SN40L-2 BMC networking is configured as part of the DataScale SN40L rack delivery. It’s not usually necessary to modify the network configuration upon delivery, although there might be situations where the network has to be reconfigured later.

You can change the network settings by running the network-settings command, as shown below. Table 5 describes the command options.

root@xrdu:~# network-settings [-h] -i [IPADDRESS] -n [NETMASK] -g [GATEWAY] -d [DNS] [{static,DHCP}]
Table 5. Command options for network-settings
Option Function

{static,DHCP}

Specify the network mode.

-h
--help

Show the help message and exit.

-i [IPADDRESS]
--ipAddress [IPADDRESS]

IP address for static connection.
Example: "10.10.0.0". Use "" for DHCP.

-n [NETMASK]
--netMask [NETMASK]

Netmask number for static network mode (between 0 to 32). Use any number for DHCP.

-g [GATEWAY]
--gateWay [GATEWAY]

Gateway for static connection.
Example: "10.10.0.0". Use "" for DHCP.

-d [DNS]
--dns [DNS]

DNS for static connection.
Example: "10.10.0.0". Use "" for DHCP.

  1. Set the IP address configuration using the network-settings command.

    Example 1: Set a static IP address of 10.10.0.15 on a /24 subnet with gateway address 10.10.0.1 and a DNS server on 10.0.0.13:

    root@xrdu:~# network-settings -i "10.10.0.15" -n 24 -g "10.10.0.1" -d "10.0.0.13" static
    Modifiying network settings ...
    Toggling network settings ...

    Example 2: Set the network mode to DHCP:

    root@xrdu:~# network-settings -i "" -n 0 -g "" -d "" DHCP
    Modifiying network settings ...
    Toggling network settings ...
  2. After you successfully run the command, restart the network service to ensure that the configuration is set and running:

    root@xrdu:~# systemctl restart systemd-networkd.service

    At this point, the current ssh session should have been terminated or be in a hung state.

  3. Open a new terminal and log in to the DataScale SN40L-2 BMC:

    $ ssh root@<SN40L-2_New_BMC_IP_Address>
    Password: <Enter root password>
  4. To confirm the IP address configuration, run the ip address command. In the command output, the assigned IP address appears as the second inet value under eth0.

    root@xrdu:~# ip address
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
    valid_lft forever preferred_lft forever
    2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether XX:XX:XX:XX:XX:XX brd ff:ff:ff:ff:ff:ff
    inet 169.254.192.89/16 brd 169.254.255.255 scope link eth0
    valid_lft forever preferred_lft forever
    inet 10.10.0.15 brd 10.10.0.255 scope global dynamic eth0
    valid_lft 40746sec preferred_lft 40746sec
    inet6...

6.5. Configure the DataScale SN40L-2 hostname

To configure or modify the DataScale SN40L-2 hostname, follow these steps:

  1. Log in to the DataScale SN40L-2 BMC:

    $ ssh root@<SN40L-2_BMC_IP_Address>
    Password: <Enter root password>
  2. Run the following command to configure or modify the DataScale SN40L-2 hostname:

    root@xrdu:~# hostnamectl set-hostname <hostname>
  3. To see the new hostname, log out and log back in to the DataScale SN40L-2 BMC.

7. Monitor and debug the DataScale SN40L rack

The DataScale® SN40L rack supports standard methods to monitor and triage the system. This page includes some tasks you can perform, such as examining log files, and also explains how collect diagnostic information for use with SambaNova support.

7.1. Overview of tools and logs

Several tools and logs can help you resolve problems. Here’s an overview:

Table 6. Monitoring and debugging tools
Task Tool See

Check the status of the DataScale SN40L-2 RDU module

xrdutool

View xrdutool diagnostics and logs

Configure SNMP alerts for third-party rack components.

SNMP alerts

Set up SNMP alerts

Diagnose problems with logs.

OS logs, BMC logs, compiler logs, application logs

Viewing system logs

Check and manage SND, view SND logs.

SND (SambaNova Daemon)

SambaNova daemon (SND) diagnostics

Debug model compilation, running models, and third-party components

Misc. tools and logs

Debugging DataScale SN40L issues

If you cannot resolve the issues yourself, create a support case and include diagnostic materials. See View SN40L-H BMC diagnostic information and logs.

7.2. View xrdutool diagnostics and logs

You use the xrdutool tool and logs to diagnose a DataScale SN40L-2 issue and to collect information for SambaNova Support to triage an issue. The tool gets the status of the DataScale SN40L-2 RDU module that the tool is run on.

Use the tool to check the overall status of the DataScale SN40L-2 RDU module and of the hosted RDUs and memory. Follow these steps to examine the output on the power and fault status of the DataScale SN40L-2 board:

  1. Log in to the DataScale SN40L-2 RDU module’s BMC that is having problems:

    $ ssh root@<BMC_IP_Address>
    Password: <Enter root password>
  2. Run the xrdutool command:

    root@xrdu:~# xrdutool status
  3. Examine the output, which gives a quick view into the state of the DataScale SN40L-2 RDU module along with two RDUs and the RDU controller. The output:

    • Shows whether any faults have been detected.

    • Shows the power state of the DataScale SN40L-2 RDU module and of the RDU.

Here’s an example:

Power is on
RDU-C Release Version: 4.4.0
RDU-C BuildDate: 10.17 1654   DesignVer: 69   BoardID: 60
XRDU_0: STATUS
--------------------------------------------------------
SYSTEM :  rdu3    rdu2    rdu1    rdu0    stby    ps      pex0    pex1    sys     p3v3        mss_op_state   mss_log_level
           1       1       1       1       1       1       1       1       1       1               4               1
--------------------------------------------------------
RDU_0/D_0  0935a00001f1d6a4 102007b367359895     RDU_0/D_1  09a6c000012eda24 605007b367359895     ON. Please verify rdu_pwr_status[0] value to determine faults
--------------------------------------------------------
ENABLES:  vddo    pvpp            pvdd    pvddq           pvtt            pavddh  pavdd   vddc
           1       1               1       1               1               1       1       1
PWRGOOD:  vddo    pvpp0   pvpp1   pvdd0    pvdd1  pvddq0  pvddq1  pvtt0   pvtt1   pavddh  pavdd   vddc0   vddc1   vddc2   vddc3
           1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
--------------------------------------------------------
RDU_1/D_0  09e9a00001a5dc64 502807b367359895     RDU_1/D_1  08e8200000bedd24 107007b367359895     ON. Please verify rdu_pwr_status[1] value to determine faults
--------------------------------------------------------
ENABLES:  vddo    pvpp            pvdd    pvddq           pvtt            pavddh  pavdd   vddc
           1       1               1       1               1               1       1       1
PWRGOOD:  vddo    pvpp0   pvpp1   pvdd0    pvdd1  pvddq0  pvddq1  pvtt0   pvtt1   pavddh  pavdd   vddc0   vddc1   vddc2   vddc3
           1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
--------------------------------------------------------
PEX_0:   fpga_p0v8_pex_pgd2   pg_p1v25_pex   pg_p1v8_pex_pll   fpga_pg_p1v8_pex
               1               1               1               1
--------------------------------------------------------
PEX_1:   fpga_p0v8_pex_pgd2   pg_p1v25_pex   pg_p1v8_pex_pll   fpga_pg_p1v8_pex
               1               1               1               1
--------------------------------------------------------
rduc_pwr_status[0] = 0x7fff
rduc_pwr_status[1] = 0x7fff
pex_pwr_status[0] = 0x7f
pex_pwr_status[1] = 0x7f
power_status_aggregate = 0x7fff
Board Type: 3
NUM_RDUS: 2
NUM_DIE_PER_RDU: 2
NUM_DIES: 4

In addition to collecting diagnostic information from the SN40L-2 RDU module directly, you can get health status of all the SN40L-2 RDU modules in the SN40L-8 node by using the SambaNova Fault Management (SNFM) utility that comes pre-installed on the host. See the SambaNova Fault Management (SNFM) utility documentation.

For details on diagnosing a DataScale SN40L-2 RDU module’s BMC and on collecting the required diagnostic and log material, see KB article #1024 in the SambaNova Support portal.

7.3. Set up SNMP alerts

To configure SNMP alerts for non-SambaNova components in the DataScale SN40L rack, see the vendor-specific documentation.

7.4. Viewing system logs

You can use the following log files to identify and resolve issues with the system or an application:

  • OS logs

  • BMC logs

  • SambaNova compiler logs

  • Application logs

7.4.1. OS logs

SambaNova does not alter the logs or log directories for Red Hat Enterprise Linux or Ubuntu. The /var/log/ directory contains most of the logs and other log tools such as journalctl.

7.4.3. SambaNova compiler logs

Additional logs for the compilers are available in a user-specified directory that was specified at the time the models were compiled. These logs are fairly low level and are requested by SambaNova Support to troubleshoot issues. For details, see Collect diagnostic materials for SambaNova Support.

You can use different compiler log verbosity settings to debug issues. See Troubleshooting Runtime.

7.4.4. Runtime logs

The following log files related to SambaNova are in the /var/log/sambaflow/runtime/ directory:

sn.log

Logs related to SambaNova graph operations. Events received by the graph process and graph-specific events (including errors) that are not logged to snd.log.

snd.log

SambaNova daemon (SND) system logs. Summary of RDU resources and hardware error events.

Additional log events such as kernel logs (from the RDU driver module) go to dmesg(1).

You can use different log verbosity settings to get more logging details for the SambaNova Runtime and other components of the software stack. See Troubleshooting Runtime.

7.5. SambaNova daemon (SND) diagnostics

The SambaNova daemon (SND) is running on the DataScale SN40L-H host module and manages several critical pieces of the SambaNova operation. The SND is responsible for:

  • Loading and unloading the RDU drivers

  • Initializing RDU system resources

  • Managing hardware faults for the RDU system

  • Enabling the debugging of the RDU system’s hardware resources

The SND is required to run graphs and models because:

  • The SND handles the RDU drivers and the initialization of RDU resources.

  • The SND is aware of issues with RDU resources and can avoid problematic resources.

The SND starts automatically:

  • At boot time of the DataScale SN40L-H OS and starts the discovery and initialization of the RDUs. This is why it is important to power on the DataScale SN40L-2 RDU modules first, before powering on the SN40L-H host module.

  • When the SambaFlow package is installed. In this case, the SND waits a few minutes after the installation for the RDU system discovery and initialization processes to complete.

7.5.1. Check SND status

To check the status of the SND, run the systemctl status snd command. Below is sample output showing what the command might return:

$ sudo systemctl status snd
● snd.service - SN Devices Service
     Loaded: loaded (/lib/systemd/system/snd.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/snd.service.d
             └─override.conf
     Active: active (running) since Wed 2022-10-19 07:10:10 PDT; 3h 24min ago
   Main PID: 5263 (snd)
      Tasks: 10 (limit: 629145)
     Memory: 164.9M
     CGroup: /system.slice/snd.service
             └─5263 /opt/sambaflow/bin/snd

7.5.2. Start, stop, and restart SND

You can start, stop, and restart the SND with the following commands:

To start the SND:

$ sudo systemctl start snd

To stop the SND:

$ sudo systemctl stop snd

To restart the SND:

$ sudo systemctl restart snd

7.5.3. Use SND for debugging

The SND CLI provides physical visibility into the entire DataScale SN40L-8 system. This allows complete access to the RDU system for debugging, triage, and validation efforts.

The SND is also responds to error events that occur on the RDU and on the entire DataScale SN40L-2 RDU module.

All logs from the SND are written to /var/log/sambaflow/runtime/snd.log. This log provides a summary of the RDU resources available to the system and includes any hardware error events that occur. The information is useful for diagnosing and resolving hardware issues.

7.6. Debugging DataScale SN40L issues

Troubleshooting might require that you debug issues with the following DataScale SN40L rack components:

  • Compilation of models

  • Running of models

  • Third-party components

7.6.1. Debug model compilation

For problems that occur while compiling models, run the following command and examine the logs that are generated in the user-specified output directory:

$ python <model_script.py> compile --output-folder=<output_directory>

You can set different levels of logging verbosity when you compile a model. See Collect diagnostic materials for SambaNova Support for best practice when creating a support case.

7.6.2. Debug running models

For problems that occur while running models, use these resources:

  • The /var/log/sambaflow/runtime/ log files

    These logs provide an initial glance into an issue that is occurring while running a model. If a problem does occur and is reproducible, enable more logging verbosity for SambaFlow Runtime. See the "Changing Runtime Log Levels" section of the SambaNova Runtime Guide for details.

  • The SambaNova Fault Management (SNFM) tool

    The SNFM tool provides a framework to

  • Monitor, log, and clear various faults associated with a DataScale SN40L-2 RDU module

  • Provide corrective actions to recover from these faults.

This capability is built into the SambaNova daemon (SND) and installed as part of SambaFlow. See "SambaNova Fault Management (SNFM) User" in the SambaNova Runtime Guide for details.

7.6.3. Debug third-party components

For operational issues with the third-party components in the DataScale SN40L rack, see the vendor-specific documentation. For issues that require additional support or for questions related to troubleshooting, open a support case through SambaNova Support. See KB article #1017, "SambaNova Systems Support Best Practices," at https://support.sambanova.ai.

Do not open a case directly with the product vendor.

7.7. Collect diagnostic materials for SambaNova Support

When you open a support case, provide details on the issue that has occurred, and initial diagnostic materials. For collecting diagnostic materials, See the following KB articles in the SambaNova Support portal.

Only SambaNova customers with a valid support contract can access the portal.
  • DataScale SN40L-2 Diagnostic Collection: KB article #1024

  • DataScale SN40L-H BMC Diagnostic Collection: KB article #1039

  • DataScale SN40L-H (Red Hat Enterprise Linux) Diagnostic Collection: KB article #1039

  • DataScale SN40L-H (Ubuntu) Diagnostic Collection: KB article #1039

  • Ethernet Data Switch Diagnostic Collection: KB Article #1053

  • Access Switch Diagnostic Collection: KB article #1053

  • Serial Console Server Diagnostic Collection: KB article #1121

  • PDU Diagnostic Collection: KB article #1120

7.8. View SN40L-H BMC diagnostic information and logs

To quickly identify a system’s status and view diagnostic information and logs for the DataScale SN40L-H BMC, follow these steps:

  1. Log in to the BMC’s Web UI and view the BMC dashboard.

    Diagnostic information

  2. For details on logs and pending events/deassertions, click the More info link in each box.

  3. To find more logs and reports, click Logs & Reports in the left pane and select a log.

    Logs & Reports item

See KB article #1039, “Diagnostic Data Collection Tool(samba_diag),” in the SambaNova Support portal (https://support.sambanova.ai) for details on:

  • Diagnosing a DataScale SN40L-H host module’s BMC

  • Diagnosing the DataScale SN40L-H host module in general

  • Collecting the required diagnostic materials and logs.

8. Back up and restore components

Use your site-specific guidelines and tools for backing up and restoring components of the DataScale® SN40L rack.

If you change the standard configuration of the networking equipment that is shipped to you, save the configuration changes you make to the devices. For details, see the SambaNova Day 1 Document and the KB articles listed below. You can find KB articles in the SambaNova Support portal at https://support.sambanova.ai.

Only SambaNova customers can access the support portal and view the KB articles.

8.1. Recover the Juniper access and data switch

For the process to recover the Juniper access switch and data switch, see the following KB articles:

  • Juniper Switch Password Recovery: KB article #1056

  • Juniper Switch Factory Reset Recovery: KB article #1056

  • Juniper Switch Saving Running Configuration: KB article #1056

8.2. Recover the Latronix serial console server

For the process to recover the Lantronix serial console server, including recovering the sysadmin password, see the following KB articles:

  • Lantronix Serial Console Server Password Recovery: KB article #1059

  • Lantronix Serial Console Server Factory Reset Recovery: KB article #1059

  • Lantronix Serial Console Server Saving Running Configuration: KB article #1059

8.3. Recover the DataScale SN40L-H host module

If the DataScale SN40L-H OS needs to be recovered, and the SN40L-H host boot partitions are not damaged, contact SambaNova Support. Recovering the SN40L-H OS to factory baseline might be possible and a faster recovery option than using the recovery ISOs.

For the processes to recover the DataScale SN40L-H host module, see the following KB articles:

  • DataScale SN40L-H OS Recovery Using the Recovery ISO – Ubuntu: KB article #1051

  • DataScale SN40L-H OS Recovery Using the Recovery ISO – Red Hat: KB article #1099

  • DataScale SN40L-H BMC Password Recovery: KB article #1021

  • DataScale SN40L-H BMC Non-Corruption Recovery: KB article #1038

8.4. Recover the DataScale SN40L-2 RDU module

For the process to recover the DataScale SN40L-2 RDU module, refer to the following KB article:

  • SambaNova DataScale SN40L-2 BMC Password Recovery: KB article #1049

8.5. Upload recovery configuration files

For the process to upload configuration files used as part of the recovery process for some of these components, see the following KB articles:

  • Uploading Configuration Files for Recovery: KB article #1055

  • Listing and Downloading Configuration Files for Recovery: KB article #1044

For questions concerning any of these recovery KB articles or for anything that is not covered here, open a support case through the SambaNova Support portal (https://support.sambanova.ai).