IPMI

From Alteeve Wiki
Revision as of 22:55, 11 August 2023 by Digimer (talk | contribs) (→‎Testing the IPMI Connection From the Peer)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

 Alteeve Wiki :: How To :: IPMI

IPMI is an acronym for Intelligent Platform Management Interface. This is a technology built into many server-grade mainboards. This is called the Baseboard Management Controller, or BMC. IPMI, via the BMC, allows "out of band" access to a server. This means that, via an IPMI interface, a user can remotely connect to a server regardless of it's power state and sensor data.

The BMC is isolated from the host machine, and so it can be used to power on a machine that is off, force a crashed machine to reboot, etc.

Company-Specific Implementations of IPMI

Many companies build on the basic IPMI standard by adding advanced features like remote console access over the network, ability to monitor devices plugged into the server like the RAID controller and its hard drives and so on. Each vendor generally has a name for their implementation of IPMI;

  • Dell calls theirs DRAC
  • Fujitsu calls theirs iRMC
  • HP calls theirs iLO
  • IBM calls their RSA

Various other vendors will have different names as well. In most cases though, they will all support the generic IPMI interface, plus additional features beyond the IPMI specification.

Fencing

In and Anvil! cluster, IPMI is used to force a node that has entered into an unknown state into a known state. Specifically, when a node stops responding, it will be forced to power off via IPMI. This is a mechanism called fencing, and it is core to the Anvil! cluster's reliability. IPMI is not the only way to fence a node, but when it is available, it is always the primary way to fence.

Configuring IPMI

Template note icon.svg
Note: In M3 Anvil! clusters, you do not need to configure IPMI manually. The Anvil! will detect and auto-configure most versions of IPMI automatically. If you found this not to be the case, please contact us so that we can add support for your hardware.

If you want to, or need to, manually configure IPMI, we'll show you now how to do that from the linux command line.

Finding the BMC

We need to assign an IP address to each IPMI BMC and then configure the user name and password to use later when connecting.

We will also use the sensor values reported by the IPMI BMC in our monitoring and alert system. If, for example, a temperate climbs too high or too fast, the alert system will be able to see this and fire off an alert.

Template note icon.svg
Note: This section walks through configuring IPMI on an-a01dr01 only on a Fujitsu Primergy server. The specifics for different brands will vary, but the guide should apply similarly.
Template note icon.svg
Note: This tutorial requires that the freeipmi and ipmitool packages are installed. If any of the Anvil! RPMs are installed, these will already be installed also.

Check to see if your machine has an IPMI BMC;

an-a01dr01
dmidecode --type 38
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.

Handle 0x0020, DMI type 38, 18 bytes
IPMI Device Information
	Interface Type: KCS (Keyboard Control Style)
	Specification Version: 2.0
	I2C Slave Address: 0x10
	NV Storage Device: Not Present
	Base Address: 0x0000000000000CA2 (I/O)
	Register Spacing: Successive Byte Boundaries

The IPMI Device Information line shows us that there is indeed an IPMI BMC on this machine. Knowing this, we should be able to see that there is a /dev/ipmi0 device available.

an-a01dr01
ls -lah /dev/ipmi0
crw-------. 1 root root 242, 0 Aug 11 15:44 /dev/ipmi0

Looking good!

Reading IPMI Data

This tells us that we should be able to talk to the BMC. If this failed, /dev/ipmi0 would not exist. If this is the case for you, please find what make and model of IPMI BMC is used in your server and look for known issues with that chip.

The first thing we'll check is that we can query IPMI's chassis data:

an-a01dr01
ipmitool chassis status
System Power         : on
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : always-off
Last Power Event     : ac-failed 
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false
Front Panel Control  : none

Excellent! If you get something like this, you're past 90% of the potential problems.

We can check more information on the hosts using mc to query the management controller.

an-a01dr01
ipmitool mc info
Device ID                 : 4
Device Revision           : 2
Firmware Revision         : 1.00
IPMI Version              : 2.0
Manufacturer ID           : 10368
Manufacturer Name         : Fujitsu Siemens
Product ID                : 1041 (0x0411)
Product Name              : Unknown (0x411)
Device Available          : yes
Provides Device SDRs      : no
Additional Device Support :
    Sensor Device
    SDR Repository Device
    SEL Device
    FRU Inventory Device
    IPMB Event Receiver
    IPMB Event Generator
    Chassis Device
Aux Firmware Rev Info     : 
    0x09
    0x44
    0x00
    0x46

Some servers will report the details of "field replaceable units"; components than can be swapped out as needed. Every server will report different data here. We can also look at the FRU, or field-replaceable units.

an-a01dr01
ipmitool fru print
FRU Device Description : Builtin FRU Device (ID 0)
 Board Mfg Date        : Sun Aug 16 23:24:00 2015
 Board Mfg             : FUJITSU
 Board Product         : D3229
 Board Serial          : 47980304
 Board Part Number     : S26361-D3229-A15
 Board Extra           : WGS06 GS02
 Board Extra           : 02
 Product Manufacturer  : FUJITSU
 Product Name          : PRIMERGY RX1330 M1
 Product Part Number   : ABN:K1537-V401-908
 Product Version       : GS01
 Product Serial        : YLWT010802
 Product Asset Tag     : 15
 Product Extra         : d10a9e
 Product Extra         : 0411
 Product Extra         : CS0f

FRU Device Description : Chassis (ID 2)
 Chassis Type          : Rack Mount Chassis
 Chassis Extra         : RX1330M1R7
 Product Manufacturer  : FUJITSU
 Product Name          : PRIMERGY RX1330 M1
 Product Part Number   : ABN:K1537-V401-908
 Product Version       : GS01
 Product Serial        : YLWT010802
 Product Asset Tag     : 15
 Product Extra         : d10a9e
 Product Extra         : 0411
 Product Extra         : CS0f

FRU Device Description : MainBoard (ID 3)
 Board Mfg Date        : Sun Aug 16 23:24:00 2015
 Board Mfg             : FUJITSU
 Board Product         : D3229
 Board Serial          : 47980304
 Board Part Number     : S26361-D3229-A15
 Board Extra           : WGS06 GS02
 Board Extra           : 02

FRU Device Description : PSU STD (ID 11)
 Device not present (Command response could not be provided)

FRU Device Description : PSU1 (ID 12)
 Board Mfg Date        : Sat May 30 19:29:00 2015
 Board Mfg             : DELTA
 Board Product         : DPS-450SB A  
 Board Serial          : DCND1522062427
 Board Part Number     : A3C40161429
 Board Extra           : S9F
 Board Extra           : 08

FRU Device Description : PSU2 (ID 13)
 Board Mfg Date        : Sat May 30 18:36:00 2015
 Board Mfg             : DELTA
 Board Product         : DPS-450SB A  
 Board Serial          : DCND1522062468
 Board Part Number     : A3C40161429
 Board Extra           : S9F
 Board Extra           : 08

We can check all the sensor value using ipmitool as well. This is actually what the cluster monitor we'll install later does.

an-a01dr01
ipmitool sdr list
Ambient          | 26 degrees C      | ok
Systemboard      | 39 degrees C      | ok
CPU              | 47 degrees C      | ok
MEM A            | 34 degrees C      | ok
MEM B            | 33 degrees C      | ok
PSU Inlet        | disabled          | ns
PSU1 Inlet       | 38 degrees C      | ok
PSU2 Inlet       | 36 degrees C      | ok
PSU              | disabled          | ns
PSU1             | 63 degrees C      | ok
PSU2             | 63 degrees C      | ok
BBU              | 33 degrees C      | ok
RAID Controller  | 73 degrees C      | ok
BATT 3.0V        | 2.44 Volts        | ok
STBY 3.3V        | 3.37 Volts        | ok
iRMC 1.8V STBY   | 1.77 Volts        | ok
iRMC 1.5V STBY   | 1.49 Volts        | ok
iRMC 1.0V STBY   | 0.98 Volts        | ok
MAIN 12V         | 12.78 Volts       | ok
MAIN 5V          | 5.24 Volts        | ok
MAIN 3.3V        | 3.33 Volts        | ok
MEM 1.35V        | 1.35 Volts        | ok
PCH 1.05V        | 1.04 Volts        | ok
MEM VTT 0.68V    | 0.66 Volts        | ok
FAN1 SYS         | 7020 RPM          | ok
FAN2 SYS         | 4080 RPM          | ok
FAN3 SYS         | 4080 RPM          | ok
FAN4 SYS         | 4440 RPM          | ok
FAN5 SYS         | 3840 RPM          | ok
FAN PSU          | disabled          | ns
FAN PSU1         | 2960 RPM          | ok
FAN PSU2         | 2560 RPM          | ok
PSU Power        | disabled          | ns
PSU1 Power       | 54 Watts          | ok
PSU2 Power       | 54 Watts          | ok
Total Power      | 108 Watts         | ok
Total Power Out  | 84 Watts          | ok
I2C1 error ratio | 0 percent         | ok
I2C2 error ratio | 0 percent         | ok
I2C3 error ratio | 0 percent         | ok
I2C4 error ratio | 0 percent         | ok
I2C5 error ratio | 0 percent         | ok
I2C6 error ratio | 0 percent         | ok
I2C7 error ratio | 0 percent         | ok
I2C8 error ratio | 0 percent         | ok
SEL Level        | 0 percent         | ok
Ambient          | 0x00              | ok
Ambient          | 0x00              | ok
CPU              | 0x00              | ok
Power Limit      | 0x00              | ok
Power Unit       | 0x00              | ok
PSU Config       | 0x00              | ok
PSU              | Not Readable      | ns
PSU              | Not Readable      | ns
PSU1             | 0x00              | ok
PSU2             | 0x00              | ok
Power Level      | 0x00              | ok
P-STATE Throttle | 0x00              | ok
System State     | 0x00              | ok
FAN1 SYS         | 0x00              | ok
FAN2 SYS         | 0x00              | ok
FAN3 SYS         | 0x00              | ok
FAN4 SYS         | 0x00              | ok
FAN5 SYS         | 0x00              | ok
FAN PSU          | Not Readable      | ns
FAN PSU1         | 0x00              | ok
FAN PSU2         | 0x00              | ok
Watchdog         | 0x00              | ok
Housing open     | 0x00              | ok
CPU detection    | 0x00              | ok
ME               | 0x00              | ok
iRMC request     | 0x00              | ok
I2C1             | 0x00              | ok
I2C2             | 0x00              | ok
I2C3             | 0x00              | ok
I2C4             | 0x00              | ok
I2C5             | 0x00              | ok
I2C6             | 0x00              | ok
I2C7             | 0x00              | ok
I2C8             | 0x00              | ok
Config backup    | 0x00              | ok
FAN5 SYS         | 0x00              | ok
Power Unit       | 0x00              | ok
PSU              | 0x00              | ok
PSU1             | 0x00              | ok
PSU2             | 0x00              | ok
Total Power Out  | 0x00              | ok
Power Level      | 0x00              | ok
System Mgmt SW   | Not Readable      | ns
Local Monitor    | 0x00              | ok
Pwr Btn override | 0x00              | ok
NMI              | 0x00              | ok
System BIOS      | Not Readable      | ns
iRMC             | 0x00              | ok

You can narrow that call down to just see temperature, power consumption and what not. That's beyond the scope of this tutorial though. The man page for ipmitool is great for seeing all the neat stuff you can do.

Configuring IPMI LAN Access

So far, we've been directly talking to the BMC, via the operating system on the host. To be useful in an Anvil! cluster (or in many other use cases), we want to be able to talk to the BMC over the network

Before we can configure it though, we need to find our "LAN channel". Different manufacturers will use different channels, so we need to be able to find the one we're using.

To find it, simply call ipmitool lan print X. Increment X, starting at 1, until you get a response.

So first, let's query LAN channel 1.

an-a01dr01
ipmitool lan print 1
Get Channel Info command failed: Destination unavailable
Invalid channel: 1

No luck; Let's try channel 2.

an-a01dr01
ipmitool lan print 2
Set in Progress         : Set Complete
Auth Type Support       : NONE MD2 MD5 PASSWORD OEM 
Auth Type Enable        : Callback : MD5 PASSWORD OEM 
                        : User     : MD5 PASSWORD OEM 
                        : Operator : MD5 PASSWORD OEM 
                        : Admin    : MD5 PASSWORD OEM 
                        : OEM      : MD5 PASSWORD OEM 
IP Address Source       : BIOS Assigned Address
IP Address              : 10.201.23.3
Subnet Mask             : 255.255.0.0
MAC Address             : 90:1b:0e:50:59:8a
SNMP Community String   : public
IP Header               : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10
BMC ARP Control         : ARP Responses Enabled, Gratuitous ARP Disabled
Gratituous ARP Intrvl   : 0.0 seconds
Default Gateway IP      : 10.20.255.254
Default Gateway MAC     : 00:00:00:00:00:00
Backup Gateway IP       : 0.0.0.0
Backup Gateway MAC      : 00:00:00:00:00:00
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0
RMCP+ Cipher Suites     : 0,1,2,3,6,7,8,11,12,15,16,17
Cipher Suite Priv Max   : XaaaaaaaaXXaXXX
                        :     X=Cipher Suite Unused
                        :     c=CALLBACK
                        :     u=USER
                        :     o=OPERATOR
                        :     a=ADMIN
                        :     O=OEM
Bad Password Threshold  : 0
Invalid password disable: no
Attempt Count Reset Int.: 0
User Lockout Interval   : 0

Found it! So we know that this server uses LAN channel 2. We'll need to use this for the next steps.

Reading IPMI Network Info

Now that we can read our IPMI data, it's time to set some values.

Template note icon.svg
Note: If you're not familiar with how IPs are assigned in an Anvil! cluster, check out "Anvil! Networking".

We know that we want to set an-a01dr01's IPMI interface to have the IP 10.201.11.3/16. We also need to setup a user on the IPMI BMC so that we can log in from other Anvil! cluster machines.

First up, let's set the IP address. Remember to use the LAN channel you found on your server. We don't actually have a gateway on the 10.201.0.0/16 Back-Channel Network, but some devices insist on a default gateway being set. For this reason, we'll always set 10.201.255.254 as the gateway server.

This requires four calls;

  1. Tell the interface to use a static IP address.
  2. Set the IP address
  3. Set the subnet mask
  4. (optional) Set the default gateway
an-a01dr01
ipmitool lan set 2 ipsrc static
ipmitool lan set 2 ipaddr 10.201.11.3
Setting LAN IP Address to 10.201.11.3
ipmitool lan set 2 netmask 255.255.0.0
Setting LAN Subnet Mask to 255.255.0.0
ipmitool lan set 2 defgw ipaddr 10.201.255.254
Setting LAN Default Gateway IP to 10.201.255.254

Now we'll again print the LAN channel information and we should see that the IP address has been set.

an-a01dr01
ipmitool lan print 2
Set in Progress         : Set In Progress
Auth Type Support       : NONE MD2 MD5 PASSWORD OEM 
Auth Type Enable        : Callback : MD5 PASSWORD OEM 
                        : User     : MD5 PASSWORD OEM 
                        : Operator : MD5 PASSWORD OEM 
                        : Admin    : MD5 PASSWORD OEM 
                        : OEM      : MD5 PASSWORD OEM 
IP Address Source       : Static Address
IP Address              : 10.201.11.3
Subnet Mask             : 255.255.0.0
MAC Address             : 90:1b:0e:50:59:8a
SNMP Community String   : public
IP Header               : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10
BMC ARP Control         : ARP Responses Enabled, Gratuitous ARP Disabled
Gratituous ARP Intrvl   : 0.0 seconds
Default Gateway IP      : 10.201.255.254
Default Gateway MAC     : 00:00:00:00:00:00
Backup Gateway IP       : 0.0.0.0
Backup Gateway MAC      : 00:00:00:00:00:00
802.1q VLAN ID          : Disabled
802.1q VLAN Priority    : 0
RMCP+ Cipher Suites     : 0,1,2,3,6,7,8,11,12,15,16,17
Cipher Suite Priv Max   : XaaaaaaaaXXaXXX
                        :     X=Cipher Suite Unused
                        :     c=CALLBACK
                        :     u=USER
                        :     o=OPERATOR
                        :     a=ADMIN
                        :     O=OEM
Bad Password Threshold  : 0
Invalid password disable: no
Attempt Count Reset Int.: 0
User Lockout Interval   : 0

Note that "Set in Progress: Set In Progress" is shown. You will want to wait until this displays "Set in Progress: Set Complete".

Lets test that we can ping the IP. We'll do this from an-a01n01.

an-a01n01
ping -c 1 10.201.11.3
PING 10.201.11.3 (10.201.11.3) 56(84) bytes of data.
64 bytes from 10.201.11.3: icmp_seq=1 ttl=64 time=0.593 ms

--- 10.201.11.3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.593/0.593/0.593/0.000 ms

Excellent!

If you don't get a response, see the section below.

Cold Resetting the BMC

Some IPMI BMCs won't update their IPs or respond to pings until they've been "cold" reset. To do this, run;

an-a01dr01
bmc reset cold
Sent cold reset command to MC

This will trigger a cold reboot of the BMC. During the reboot, no calls to the IPMI BMC will work. Likewise, nothing will be able to ping the BMC. In most cases, you will hear the fans on the machine spool up close to the end of the reboot process.

Template warning icon.svg
Warning: Be patient! It's normal for the BMC cold reset to take a few minutes to complete.

Find the Administrator IPMI User ID

Next up is to find the IPMI administrative user name and user ID.

To see the list of users, run the following.

Template note icon.svg
Note: It can get a bit confusing, as we'll be using digits to reference two different things. We found above that the LAN channel is 2. Shortly, we'll find the digit referencing the administrative user. It's important to keep them sorted in your head.

We'll use LAN channel 2 to request the list of users on the IPMI BMC.

an-a01dr01
ipmitool user list 2
ID  Name	     Callin  Link Auth	IPMI Msg   Channel Priv Limit
1                    false   false      true       Unknown (0x00)
2   admin            false   false      true       OEM
3                    true    false      false      NO ACCESS
4                    true    false      false      NO ACCESS
5                    true    false      false      NO ACCESS
6                    true    false      false      NO ACCESS
7                    true    false      false      NO ACCESS
8                    true    false      false      NO ACCESS
9                    true    false      false      NO ACCESS
10                   true    false      false      NO ACCESS
11                   true    false      false      NO ACCESS
12                   true    false      false      NO ACCESS
13                   true    false      false      NO ACCESS
14                   true    false      false      NO ACCESS
15                   true    false      false      NO ACCESS
16                   true    false      false      NO ACCESS
Template note icon.svg
Note: If you see an error like "Get User Access command failed (channel 2, user 3): Unknown (0x32)", it is safe to ignore.

Normally you should see OEM or ADMINISTRATOR under the Channel Priv Limit column. Above we see that the user named admin with ID 2 is OEM, so that is the user we will use.

Template warning icon.svg
Warning: Many IPMI BMCs will not use passwords over 20 or 16 characters long. We're going to try to set the password "super secret password", as we're using in the main Anvil! cluster tutorial. However, we'll see that this will cause problems, and we'll show how we deal with that.

To set the password to "super secret password", run the following command and then enter the word super secret password twice.

Template note icon.svg
Note: The 2 in the next argument corresponds to the user ID, not the LAN channel!
an-a01dr01
ipmitool user set password 2
Password for user 2: 
Password for user 2:
Password is too long (> 20 bytes)

Above, we see that the password was too long. So this time, we'll try "super secret passwo", which is 20 characters long.

an-a01dr01
ipmitool user set password 2
Password for user 2: 
Password for user 2:
IPMI command failed: Invalid data field in request
Set User Password command failed (user 2)

And it still fails! Well, this is because some IPMI BMCs do not like space characters in their password. So this time, lets try "supersecretpassword", which is 20 characters long, but lacking the spaces.

an-a01dr01
ipmitool user set password 2
Password for user 2: 
Password for user 2:
Set User Password command successful (user 2)

There we go! (or have we... ?)

The purpose of showing these failed password attempts is to show you why you might find that the Anvil! cluster will manipulate the password you use for the cluster when setting the password for a host's IPMI BMC. We have to work within the restrictions of the BMC itself.

Testing the IPMI Connection From the Peer

At this point, we've set each node's IPMI BMC network address and admin user's password. Now it's time to make sure it works.

In the example above, we walked through setting up an-a01dr01's IPMI BMC. So here, we will log into an-a01n01 and try to connect to 10.201.11.3 (an-a01dr01's IP we set earlier) to make sure everything works.

  • From an-a01n01
an-a01n01
ipmitool -I lanplus -U admin -P supersecretpassword -H 10.201.11.3 chassis power status
Error: Unable to establish IPMI v2 / RMCP+ session

Wait, what?!

We "successfully" set the password, but the BMC truncated the password we gave it to only 16 characters. It threw an error when the password was over 20 characters, but auto-truncated to 16 characters. Lets confirm this by shortening the password to supersecretpassw.

an-a01n01
ipmitool -I lanplus -U admin -P supersecretpassw -H 10.201.11.3 chassis power status
Chassis Power is on

Excellent! Now wasn't that a pain in the arse?

Template note icon.svg
Note: As should now be understandable, the Anvil! software always shortens the password given for a cluster to 16 characters long, and removes spaces when configuring the IPMI BMC. So if you can't access your BMC after integrating into an Anvil!, try shortening the password and removing spaces, and it will likely work.

Woohoo!

 

Any questions, feedback, advice, complaints or meanderings are welcome.
Us: Alteeve's Niche! Support: Mailing List IRC: #clusterlabs on Libera Chat
© Alteeve's Niche! Inc. 1997-2023   Anvil! "Intelligent Availability™" Platform
legal stuff: All info is provided "As-Is". Do not use anything here unless you are willing and able to take responsibility for your own actions.