OPENSM man page on RedHat

OPENSM man page on RedHat
Man page or keyword search:
man Server 29550 pages
apropos Keyword Search (all sections)
Output format
OPENSM(8)		       OpenIB Management		     OPENSM(8)

NAME
       opensm - InfiniBand subnet manager and administration (SM/SA)

SYNOPSIS
       opensm  [--version]]  [-F  |  --config  <file_name>]  [-c(reate-config)
       <file_name>] [-g(uid) <GUID in hex>] [-l(mc) <LMC>] [-p(riority)	 <PRI‐
       ORITY>]	[--subnet_prefix  <PREFIX  in hex>] [-smkey <SM_Key>] [--sm_sl
       <SL number>] [-r(eassign_lids)] [-R <engine name(s)> | --routing_engine
       <engine	name(s)>]  [--do_mesh_analysis]	 [--lash_start_vl <vl number>]
       [-A  |  --ucast_cache]  [-z  |  --connect_roots]	 [-M  <file  name>   |
       --lid_matrix_file  <file	 name>]	 [-U  <file  name> | --lfts_file <file
       name>] [-S | --sadb_file <file name>] [-a | --root_guid_file  <path  to
       file>]  [-u | --cn_guid_file <path to file>] [-G | --io_guid_file <path
       to file>] [--port-shifting] [--scatter-ports] [-H |  --max_reverse_hops
       <max  reverse  hops  allowed>] [-X | --guid_routing_order_file <path to
       file>] [-m  |  --ids_guid_file  <path  to  file>]  [-o(nce)]  [-s(weep)
       <interval>]  [-t(imeout) <milliseconds>] [--retries <number>] [-maxsmps
       <number>] [-console [off | local | socket |  loopback]]	[-console-port
       <port>]	  [-i(gnore-guids)    <equalize-ignore-guids-file>]    [-w   |
       --hop_weights_file <path to file>]  [-O	|  --port_search_ordering_file
       <path  to  file>]  [-O | --dimn_ports_file <path to file>] (DEPRECATED)
       [-f <log file path> | --log_file <log file path> ]  [-L	|  --log_limit
       <size in MB>] [-e(rase_log_file)] [-P(config) <partition config file> ]
       [-N | --no_part_enforce] (DEPRECATED) [-Z | --part_enforce [both | in |
       out  |  off]]  [-W  |  --allow_both_pkeys] [-Q | --qos [-Y | --qos_pol‐
       icy_file <file name>]] [--congestion-control]  [--cckey	<key>]	[-y  |
       --stay_on_fatal]	  [-B	|  --daemon]  [-I  |  --inactive]  [--perfmgr]
       [--perfmgr_sweep_time_s	 <seconds>]   [--prefix_routes_file    <path>]
       [--consolidate_ipv6_snm_req] [--log_prefix <prefix text>] [--torus_con‐
       fig <path to file>] [-v(erbose)] [-V] [-D <flags>] [-d(ebug)  <number>]
       [-h(elp)] [-?]

DESCRIPTION
       opensm  is  an  InfiniBand compliant Subnet Manager and Administration,
       and runs on top of OpenIB.

       opensm provides an implementation of an InfiniBand Subnet  Manager  and
       Administration.	Such a software entity is required to run for in order
       to initialize the InfiniBand hardware (at least one per each InfiniBand
       subnet).

       opensm  also now contains an experimental version of a performance man‐
       ager as well.

       opensm defaults were designed to meet the common case usage on clusters
       with up to a few hundred nodes. Thus, in this default mode, opensm will
       scan the IB fabric, initialize it, and sweep occasionally for changes.

       opensm attaches to a specific IB port on the local machine and  config‐
       ures  only  the fabric connected to it. (If the local machine has other
       IB ports, opensm will ignore  the  fabrics  connected  to  those	 other
       ports). If no port is specified, it will select the first "best" avail‐
       able port.

       opensm can present the available ports and prompt for a port number  to
       attach to.

       By  default,  the  run  is  logged  to two files: /var/log/messages and
       /var/log/opensm.log.  The first file will register only	general	 major
       events, whereas the second will include details of reported errors. All
       errors reported in this second file should be treated as indicators  of
       IB  fabric  health issues.  (Note that when a fatal and non-recoverable
       error occurs, opensm will exit.)	 Both log  files  should  include  the
       message "SUBNET UP" if opensm was able to setup the subnet correctly.

OPTIONS
       --version
	      Prints OpenSM version and exits.

       -F, --config <config file>
	      The  name	 of  the  OpenSM  config  file.	 When  not  specified
	      /etc/rdma/opensm.conf will be used (if exists).

       -c, --create-config <file name>
	      OpenSM will dump its configuration to  the  specified  file  and
	      exit.   This is a way to generate OpenSM configuration file tem‐
	      plate.

       -g, --guid <GUID in hex>
	      This option specifies the	 local	port  GUID  value  with	 which
	      OpenSM  should  bind.   OpenSM may be bound to 1 port at a time.
	      If GUID given is 0, OpenSM displays  a  list  of	possible  port
	      GUIDs and waits for user input.  Without -g, OpenSM tries to use
	      the default port.

       -l, --lmc <LMC value>
	      This option specifies the subnet's LMC  value.   The  number  of
	      LIDs  assigned  to each port is 2^LMC.  The LMC value must be in
	      the range 0-7.  LMC values >  0  allow  multiple	paths  between
	      ports.   LMC values > 0 should only be used if the subnet topol‐
	      ogy actually provides multiple paths between ports, i.e.	multi‐
	      ple interconnects between switches.  Without -l, OpenSM defaults
	      to LMC = 0, which allows one path between any two ports.

       -p, --priority <Priority value>
	      This option specifies the SM´s PRIORITY.	This will  effect  the
	      handover	cases,	where  master  is chosen by priority and GUID.
	      Range goes from 0 (default and lowest priority) to 15 (highest).

       --subnet_prefix <PREFIX in hex>
	      This option specifies the subnet prefix to use in on the fabric.
	      The default prefix is 0xfe80000000000000.	 OpenMPI in particular
	      requires separate fabrics plugged into different ports  to  have
	      different prefixes or else it won't run.

       -smkey <SM_Key value>
	      This  option  specifies  the  SM´s  SM_Key (64 bits).  This will
	      effect SM authentication.	 Note that OpenSM  version  3.2.1  and
	      below  used  the	default	 value '1' in a host byte order, it is
	      fixed now but you may need this option to interoperate with  old
	      OpenSM running on a little endian machine.

       --sm_sl <SL number>
	      This option sets the SL to use for communication with the SM/SA.
	      Defaults to 0.

       -r, --reassign_lids
	      This option causes OpenSM to reassign LIDs  to  all  end	nodes.
	      Specifying  -r  on  a running subnet may disrupt subnet traffic.
	      Without -r, OpenSM attempts to preserve existing LID assignments
	      resolving multiple use of same LID.

       -R, --routing_engine <Routing engine names>
	      This  option chooses routing engine(s) to use instead of Min Hop
	      algorithm (default).  Multiple routing engines can be  specified
	      separated	 by  commas so that specific ordering of routing algo‐
	      rithms will be tried if earlier routing engines  fail.   If  all
	      configured  routing  engines fail, OpenSM will always attempt to
	      route with Min Hop unless 'no_fallback' is included in the  list
	      of  routing  engines.   Supported	 engines:  minhop, updn, dnup,
	      file, ftree, lash, dor, torus-2QoS, dfsssp, sssp.

       --do_mesh_analysis
	      This option enables additional analysis  for  the	 lash  routing
	      engine to precondition switch port assignments in regular carte‐
	      sian meshes which may reduce the number of SLs required to  give
	      a deadlock free routing.

       --lash_start_vl <vl number>
	      This  option  sets  the  starting VL to use for the lash routing
	      algorithm.  Defaults to 0.

       -A, --ucast_cache
	      This option enables unicast routing cache and  prevents  routing
	      recalculation  (which  is	 a heavy task in a large cluster) when
	      there was no topology change detected during the heavy sweep, or
	      when  the	 topology change does not require new routing calcula‐
	      tion, e.g. when one or more CAs/RTRs/leaf switches  going	 down,
	      or  one  or more of these nodes coming back after being down.  A
	      very common case that is handled by the unicast routing cache is
	      host reboot, which otherwise would cause two full routing recal‐
	      culations: one when the host goes down, and the other  when  the
	      host comes back online.

       -z, --connect_roots
	      This  option  enforces routing engines (up/down and fat-tree) to
	      make connectivity between root switches and in this  way	to  be
	      fully IBA compliant. In many cases this can violate "pure" dead‐
	      lock free algorithm, so use it carefully.

       -M, --lid_matrix_file <file name>
	      This option specifies the name of the lid matrix dump file  from
	      where switch lid matrices (min hops tables will be loaded.

       -U, --lfts_file <file name>
	      This  option  specifies  the  name  of  the LFTs file from where
	      switch forwarding tables will be loaded.

       -S, --sadb_file <file name>
	      This option specifies the name of the SA DB dump file from where
	      SA database will be loaded.

       -a, --root_guid_file <file name>
	      Set the root nodes for the Up/Down or Fat-Tree routing algorithm
	      to the guids provided in the given file (one to a line).

       -u, --cn_guid_file <file name>
	      Set the compute nodes for the Fat-Tree routing algorithm to  the
	      guids provided in the given file (one to a line).

       -G, --io_guid_file <file name>
	      Set  the	I/O  nodes  for	 the Fat-Tree routing algorithm to the
	      guids provided in the given file (one to a line).	 I/O nodes are
	      non-CN  nodes allowed to use up to max_reverse_hops switches the
	      wrong way around to improve connectivity.

       --port-shifting
	      This option enables a feature called  port  shifting.   In  some
	      fabrics,	particularly  cluster  environments,  routes  commonly
	      align and congest	 with  other  routes  due  to  algorithmically
	      unchanging  traffic  patterns.  This routing option will "shift"
	      routing around in an attempt to alleviate this problem.

       --scatter-ports
	      This option will randomize port selecting in routing.

       -H, --max_reverse_hops <max reverse hops allowed>
	      Set the maximum number of reverse hops an I/O node is allowed to
	      make. A reverse hop is the use of a switch the wrong way around.

       -m, --ids_guid_file <file name>
	      Name  of	the map file with set of the IDs which will be used by
	      Up/Down routing algorithm instead of node GUIDs (format:	<guid>
	      <id> per line).

       -X, --guid_routing_order_file <file name>
	      Set  the	order  port  guids  will  be routed for the MinHop and
	      Up/Down routing algorithms to the guids provided	in  the	 given
	      file (one to a line).

       -o, --once
	      This  option  causes  OpenSM  to configure the subnet once, then
	      exit.  Ports remain in the ACTIVE state.

       -s, --sweep <interval value>
	      This option specifies  the  number  of  seconds  between	subnet
	      sweeps.	Specifying -s 0 disables sweeping.  Without -s, OpenSM
	      defaults to a sweep interval of 10 seconds.

       -t, --timeout <value>
	      This option specifies the time in milliseconds used for transac‐
	      tion  timeouts.	Timeout	 values	 should	 be  > 0.  Without -t,
	      OpenSM defaults to a timeout value of 200 milliseconds.

       --retries <number>
	      This option specifies the number of retries  used	 for  transac‐
	      tions.   Without	--retries,  OpenSM  defaults  to 3 retries for
	      transactions.

       -maxsmps <number>
	      This option specifies the number of VL15 SMP MADs allowed on the
	      wire  at	any  one time.	Specifying -maxsmps 0 allows unlimited
	      outstanding SMPs.	 Without -maxsmps, OpenSM defaults to a	 maxi‐
	      mum of 4 outstanding SMPs.

       -console [off | local | loopback | socket]
	      This  option  brings up the OpenSM console (default off).	 Note,
	      loopback and socket open a socket	 which	can  be	 connected  to
	      WITHOUT  CREDENTIALS.   Loopback	is  safer if access to your SM
	      host is controlled.  tcp_wrappers (hosts.[allow|deny])  is  used
	      with  loopback  and  socket.   loopback  and socket will only be
	      available if OpenSM  was	built  with  --enable-console-loopback
	      (default	yes)  and --enable-console-socket (default no) respec‐
	      tively.

       -console-port <port>
	      Specify an alternate telnet port for the socket console (default
	      10000).	Note that this option only appears if OpenSM was built
	      with --enable-console-socket.

       -i, -ignore-guids <equalize-ignore-guids-file>
	      This option provides the means to define a set of ports (by node
	      guid  and	 port  number)	that  will be ignored by the link load
	      equalization algorithm.

       -w, --hop_weights_file <path to file>
	      This option provides weighting factors per port  representing  a
	      hop  cost	 in  computing	the  lid matrix.  The file consists of
	      lines containing a switch port GUID (specified as a 64  bit  hex
	      number, with leading 0x), output port number, and weighting fac‐
	      tor.  Any port not listed in the file defaults  to  a  weighting
	      factor  of  1.   Lines  starting	with  # are comments.  Weights
	      affect only the output route from the port, so many useful  con‐
	      figurations will require weights to be specified in pairs.

       -O, --port_search_ordering_file <path to file>
	      This  option  tweaks  the routing. It suitable for two cases: 1.
	      While using DOR routing algorithm.  This option provides a  map‐
	      ping  between  hypercube	dimensions  and	 ports on a per switch
	      basis for the DOR routing engine.	 The file  consists  of	 lines
	      containing a switch node GUID (specified as a 64 bit hex number,
	      with leading 0x) followed by a list of  non-zero	port  numbers,
	      separated	 by  spaces,  one  switch per line.  The order for the
	      port numbers is in one to one correspondence to the  dimensions.
	      Ports  not listed on a line are assigned to the remaining dimen‐
	      sions, in port order.  Anything after a  #  is  a	 comment.   2.
	      While using general routing algorithm.  This option provides the
	      order of the ports that would be chosen for routing,  from  each
	      switch rather than searching for an appropriate port from port 1
	      to N.  The file consists of lines containing a switch node  GUID
	      (specified  as a 64 bit hex number, with leading 0x) followed by
	      a list of non-zero port numbers, separated by spaces, one switch
	      per  line.  In case of DOR, the order for the port numbers is in
	      one to one correspondence to the dimensions.  Ports  not	listed
	      on  a  line  are	assigned  to the remaining dimensions, in port
	      order.  Anything after a # is a comment.

       -O, --dimn_ports_file <path to file> (DEPRECATED)
	      This is a deprecated flag. Please use -port_search_ordering_file
	      instead.	 This  option  provides	 a  mapping  between hypercube
	      dimensions and ports on a per switch basis for the  DOR  routing
	      engine.	The  file  consists  of lines containing a switch node
	      GUID (specified as a 64 bit hex number, with  leading  0x)  fol‐
	      lowed  by	 a list of non-zero port numbers, separated by spaces,
	      one switch per line.  The order for the port numbers is  in  one
	      to  one correspondence to the dimensions.	 Ports not listed on a
	      line are assigned to the remaining dimensions,  in  port	order.
	      Anything after a # is a comment.

       -x, --honor_guid2lid
	      This  option  forces  OpenSM to honor the guid2lid file, when it
	      comes  out  of  Standby  state,  if  such	 file	exists	 under
	      OSM_CACHE_DIR, and is valid.  By default, this is FALSE.

       -f, --log_file <file name>
	      This  option  defines the log to be the given file.  By default,
	      the log goes to /var/log/opensm.log.  For the log to go to stan‐
	      dard output use -f stdout.

       -L, --log_limit <size in MB>
	      This  option defines maximal log file size in MB. When specified
	      the log file will be truncated upon reaching this limit.

       -e, --erase_log_file
	      This option will cause deletion of the log file  (if  it	previ‐
	      ously exists). By default, the log file is accumulative.

       -P, --Pconfig <partition config file>
	      This  option  defines the optional partition configuration file.
	      The default name is /etc/rdma/partitions.conf.

       --prefix_routes_file <file name>
	      Prefix routes control how the SA responds to path record queries
	      for  off-subnet  DGIDs.	By default, the SA fails such queries.
	      The PREFIX ROUTES section below describes the format of the con‐
	      figuration       file.	    The	     default	  path	    is
	      /etc/rdma/prefix-routes.conf.

       -Q, --qos
	      This option enables QoS setup. It is disabled by default.

       -Y, --qos_policy_file <file name>
	      This option defines the optional QoS policy  file.  The  default
	      name     is     /etc/rdma/qos-policy.conf.    See	   QoS_manage‐
	      ment_in_OpenSM.txt in opensm doc for more information on config‐
	      uring QoS policy via this file.

       --congestion_control
	      (EXPERIMENTAL) This option enables congestion control configura‐
	      tion.  It is disabled by default.	 See config file  for  conges‐
	      tion  control configuration options.  --cc_key <key> (EXPERIMEN‐
	      TAL) This option configures the CCkey to	use  when  configuring
	      congestion  control.  Note that this option does not configure a
	      new CCkey into switches and CAs.	Defaults to 0.

       -N, --no_part_enforce (DEPRECATED)
	      This is a deprecated flag. Please	 use  --part_enforce  instead.
	      This  option  disables  partition enforcement on switch external
	      ports.

       -Z, --part_enforce [both | in | out | off]
	      This  option  indicates  the  partition  enforcement  type  (for
	      switches).   Enforcement type can be inbound only (in), outbound
	      only (out), both or disabled (off). Default is both.

       -W, --allow_both_pkeys
	      This option indicates whether both full and  limited  membership
	      on  the  same  partition	can  be	 configured  in the PKeyTable.
	      Default is not to allow both pkeys.

       -y, --stay_on_fatal
	      This option will cause SM not to exit  on	 fatal	initialization
	      issues: if SM discovers duplicated guids or a 12x link with lane
	      reversal badly configured.  By default,  the  SM	will  exit  on
	      these errors.

       -B, --daemon
	      Run in daemon mode - OpenSM will run in the background.

       -I, --inactive
	      Start SM in inactive rather than init SM state.  This option can
	      be used in conjunction with the perfmgr so as to	run  a	stand‐
	      alone  performance  manager without SM/SA.  However, this is NOT
	      currently implemented in the performance manager.

       -perfmgr
	      Enable the perfmgr.  Only takes effect if	 --enable-perfmgr  was
	      specified	 at configure time.  See performance-manager-HOWTO.txt
	      in opensm doc for more information on running perfmgr.

       -perfmgr_sweep_time_s <seconds>
	      Specify the sweep time for the performance  manager  in  seconds
	      (default is 180 seconds).	 Only takes effect if --enable-perfmgr
	      was specified at configure time.

       --consolidate_ipv6_snm_req
	      Use shared MLID for IPv6 Solicited  Node	Multicast  groups  per
	      MGID scope and P_Key.

       --log_prefix <prefix text>
	      This  option  specifies  the  prefix to the syslog messages from
	      OpenSM.  A suitable prefix can be used to identify the IB subnet
	      in syslog messages when two or more instances of OpenSM run in a
	      single node to manage multiple fabrics. For example, in a	 dual-
	      fabric  (or dual-rail) IB cluster, the prefix for the first fab‐
	      ric could be "mpi" and the other fabric could be "storage".

       --torus_config <path to torus-2QoS config file>
	      This option defines the file name for  the  extra	 configuration
	      information  needed  for	the  torus-2QoS	 routing engine.   The
	      default name is /etc/rdma/torus-2QoS.conf

       -v, --verbose
	      This option increases the log verbosity level.   The  -v	option
	      may  be  specified  multiple  times to further increase the ver‐
	      bosity level.  See the -D option for more information about  log
	      verbosity.

       -V     This  option  sets  the  maximum	verbosity level and forces log
	      flushing.	 The -V option is equivalent to ´-D 0xFF -d  2´.   See
	      the -D option for more information about log verbosity.

       -D <value>
	      This  option  sets  the log verbosity level.  A flags field must
	      follow the -D option.  A bit set/clear in the flags enables/dis‐
	      ables a specific log level as follows:

	       BIT    LOG LEVEL ENABLED
	       ----   -----------------
	       0x01 - ERROR (error messages)
	       0x02 - INFO (basic messages, low volume)
	       0x04 - VERBOSE (interesting stuff, moderate volume)
	       0x08 - DEBUG (diagnostic, high volume)
	       0x10 - FUNCS (function entry/exit, very high volume)
	       0x20 - FRAMES (dumps all SMP and GMP frames)
	       0x40 - ROUTING (dump FDB routing information)
	       0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM log‐
	      ging)

	      Without -D, OpenSM defaults to ERROR + INFO  (0x3).   Specifying
	      -D 0 disables all messages.  Specifying -D 0xFF enables all mes‐
	      sages (see -V).  High verbosity levels  may  require  increasing
	      the transaction timeout with the -t option.

       -d, --debug <value>
	      This  option  specifies  a  debug option.	 These options are not
	      normally needed.	The number  following  -d  selects  the	 debug
	      option to enable as follows:

	       OPT   Description
	       ---    -----------------
	       -d0  - Ignore other SM nodes
	       -d1  - Force single threaded dispatching
	       -d2  - Force log flushing after each log message
	       -d3  - Disable multicast support

       -h, --help
	      Display this usage info then exit.

       -?     Display this usage info then exit.

ENVIRONMENT VARIABLES
       The following environment variables control opensm behavior:

       OSM_TMP_DIR  - controls the directory in which the temporary files gen‐
       erated by opensm	 are  created.	These  files  are:  opensm-subnet.lst,
       opensm.fdbs, and opensm.mcfdbs. By default, this directory is /var/log.

       OSM_CACHE_DIR - opensm stores certain data to the disk such that subse‐
       quent  runs   are   consistent.	 The   default	 directory   used   is
       /var/cache/opensm.  The following files are included in it:

	guid2lid  - stores the LID range assigned to each GUID
	guid2mkey - stores the MKey previously assiged to each GUID
	neighbors - stores a map of the GUIDs at either end of each link
		    in the fabric

NOTES
       When  opensm receives a HUP signal, it starts a new heavy sweep as if a
       trap was received or a topology change was found.

       Also, SIGUSR1 can be used to trigger a  reopen  of  /var/log/opensm.log
       for logrotate purposes.

PARTITION CONFIGURATION
       The   default   name   of   OpenSM  partitions  configuration  file  is
       /etc/rdma/partitions.conf. The default may  be  changed	by  using  the
       --Pconfig (-P) option with OpenSM.

       The  default  partition	will be created by OpenSM unconditionally even
       when partition configuration file does not exist or cannot be accessed.

       The default partition has P_Key value 0x7fff. OpenSM´s port will always
       have  full  membership  in  default partition. All other end ports will
       have full membership if the partition configuration file is  not	 found
       or cannot be accessed, or limited membership if the file exists and can
       be accessed but there is no rule for the Default partition.

       Effectively, this amounts to the same as if one of the following	 rules
       below appear in the partition configuration file.

       In the case of no rule for the Default partition:

       Default=0x7fff : ALL=limited, SELF=full ;

       In  the	case  of  no  partition	 configuration	file or file cannot be
       accessed:

       Default=0x7fff : ALL=full ;

       File Format

       Comments:

       Line content followed after ´#´ character is  comment  and  ignored  by
       parser.

       General file format:

       <Partition Definition>:[<newline>]<Partition Properties>;

	    Partition Definition:
	      [PartitionName][=PKey][,ipoib_bc_flags][,defmember=full|limited]

	       PartitionName   - string, will be used with logging. When omit‐
       ted
				empty string will be used.
	       PKey	      - P_Key value for this partition.	 Only  low  15
       bits will
				be used. When omitted will be autogenerated.
	       ipoib_bc_flags  -  used to indicate/specify IPoIB capability of
       this partition.

	       defmember=full|limited|both - specifies default membership  for
       port guid
				list. Default is limited.

	    ipoib_bc_flags:
	       ipoib_flag|[mgroup_flag]*

	       ipoib_flag  -  indicates	 that  this  partition may be used for
       IPoIB, as
			    a result the IPoIB broadcast group will be created
       with
			    the flags given, if any.

	    Partition Properties:
	      [<Port list>|<MCast Group>]* | <Port list>

	    Port list:
	       <Port Specifier>[,<Port Specifier>]

	    Port Specifier:
	       <PortGUID>[=[full|limited|both]]

	       PortGUID		 - GUID of partition member EndPort. Hexadeci‐
       mal
				  numbers should start from 0x,	 decimal  num‐
       bers
				  are accepted too.

	       full, limited,	- indicates full and/or limited membership for
       this
	       both		  port.	 When omitted (or  unrecognized)  lim‐
       ited
				  membership  is  assumed. Both indicates both
       full
				  and limited membership for this port.

	    MCast Group:
	       mgid=gid[,mgroup_flag]*<newline>

				- gid specified is verified to be a  Multicast
       address
				  IP groups are verified to match the rate and
       mtu of the
				  broadcast group.  The P_Key bits of the mgid
       for IP
				  groups  are  verified	 to  either  match the
       P_Key specified
				  in by "Partition Definition" or if they  are
       0x0000 the
				  P_Key will be copied into those bits.

	    mgroup_flag:
	       rate=<val>  - specifies rate for this MC group
			     (default is 3 (10GBps))
	       mtu=<val>   - specifies MTU for this MC group
			     (default is 4 (2048))
	       sl=<val>	   - specifies SL for this MC group
			     (default is 0)
	       scope=<val> - specifies scope for this MC group
			     (default is 2 (link local)).  Multiple scope set‐
       tings
			     are permitted for a partition.
			     NOTE: This overwrites the	scope  nibble  of  the
       specified
				   mgid.    Furthermore	  specifying  multiple
       scope
				   settings will result in multiple MC groups
				   being created.
	       qkey=<val>      - specifies the Q_Key for this MC group
				 (default: 0x0b1b for IP groups, 0  for	 other
       groups)
	       tclass=<val>    - specifies tclass for this MC group
				 (default is 0)
	       FlowLabel=<val> - specifies FlowLabel for this MC group
				 (default is 0)

	    newline: '0

       Note that values for rate, mtu, and scope, for both partitions and mul‐
       ticast groups, should be specified as defined in the IBTA specification
       (for example, mtu=4 for 2048).

       There are several useful keywords for PortGUID definition:

	- 'ALL' means all end ports in this subnet.
	- 'ALL_CAS' means all Channel Adapter end ports in this subnet.
	- 'ALL_SWITCHES' means all Switch end ports in this subnet.
	- 'ALL_ROUTERS' means all Router end ports in this subnet.
	- 'SELF' means subnet manager's port.

       Empty list means no ports in this partition.

       Notes:

       White space is permitted between delimiters ('=', ',',':',';').

       PartitionName  does not need to be unique, PKey does need to be unique.
       If PKey is repeated then those partition configurations will be	merged
       and first PartitionName will be used (see also next note).

       It  is possible to split partition configuration in more than one defi‐
       nition, but then PKey should be explicitly specified (otherwise differ‐
       ent PKey values will be generated for those definitions).

       Examples:

	Default=0x7fff : ALL, SELF=full ;
	Default=0x7fff : ALL, ALL_SWITCHES=full, SELF=full ;

	NewPartition  , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306
       ;

	YetAnotherOne = 0x300 : SELF=full ;
	YetAnotherOne = 0x300 : ALL=limited ;

	ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
	# 0x123453, 0x123454 will be limited
	ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
	# 0x123456, 0x123457 will be limited
	ShareIO	  =   0x80   :	 defmember=limited   :	 0x123456,   0x123457,
       0x123458=full;
	ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
	ShareIO	  =   0x80  ,  defmember=full  :  0x12345b,  0x12345c=limited,
       0x12345d;

	# multicast groups added to default
	Default=0x7fff,ipoib:
	       mgid=ff12:401b::0707,sl=1 # random IPv4 group
	       mgid=ff12:601b::16    # MLDv2-capable routers
	       mgid=ff12:401b::16    # IGMP
	       mgid=ff12:601b::2     # All routers
	       mgid=ff12::1,sl=1,Q_Key=0xDEADBEEF,rate=3,mtu=2 # random group
	       ALL=full;

       Note:

       The following rule is equivalent to how OpenSM used to run prior to the
       partition manager:

	Default=0x7fff,ipoib:ALL=full;

QOS CONFIGURATION
       There are a set of QoS related low-level configuration parameters.  All
       these parameter names are prefixed by "qos_" string.  Here  is  a  full
       list of these parameters:

	qos_max_vls    - The maximum number of VLs that will be on the subnet
	qos_high_limit - The limit of High Priority component of VL
			 Arbitration table (IBA 7.6.9)
	qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
			 template
	qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
			 template
			 Both VL arbitration templates are pairs of
			 VL and weight
	qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
			 a list of VLs corresponding to SLs 0-15 (Note
			 that VL15 used here means drop this SL)

       Typical default values (hard-coded in OpenSM initialization) are:

	qos_max_vls 15
	qos_high_limit 0
	qos_vlarb_low
       0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
	qos_vlarb_high
       0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
	qos_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7

       The  syntax is compatible with rest of OpenSM configuration options and
       values may be stored in OpenSM config file (cached options file).

       In addition to the above, we  may  define  separate  QoS	 configuration
       parameters sets for various target types. As targets, we currently sup‐
       port CAs, routers, switch external ports, and switch's enhanced port 0.
       The  names of such specialized parameters are prefixed by "qos_<type>_"
       string. Here is a full list of the currently supported sets:

	qos_ca_	 - QoS configuration parameters set for CAs.
	qos_rtr_ - parameters set for routers.
	qos_sw0_ - parameters set for switches' port 0.
	qos_swe_ - parameters set for switches' external ports.

       Examples:
	qos_sw0_max_vls=2
	qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
	qos_swe_high_limit=0

PREFIX ROUTES
       Prefix routes control how the SA responds to path  record  queries  for
       off-subnet  DGIDs.   By	default, the SA fails such queries.  Note that
       IBA does not specify how the SA should obtain  off-subnet  path	record
       information.   The  prefix  routes configuration is meant as a stop-gap
       until the specification is completed.

       Each line in the configuration file is a 64-bit prefix  followed	 by  a
       64-bit  GUID,  separated by white space.	 The GUID specifies the router
       port on the local subnet that will handle the prefix.  Blank lines  are
       ignored,	 as is anything between a # character and the end of the line.
       The prefix and GUID are both  in	 hex,  the  leading  0x	 is  optional.
       Either,	or  both, can be wild-carded by specifying an asterisk instead
       of an explicit prefix or GUID.

       When responding to a path record query for an off-subnet	 DGID,	opensm
       searches	 for the first prefix match in the configuration file.	There‐
       fore, the order of the lines in the configuration file is important:  a
       wild-carded  prefix  at the beginning of the configuration file renders
       all subsequent lines useless.  If there is no match, then opensm	 fails
       the  query.   It is legal to repeat prefixes in the configuration file,
       opensm will return the path to the first available matching router.   A
       configuration  file  with  a single line where both prefix and GUID are
       wild-carded means that a path record query  specifying  any  off-subnet
       DGID should return a path to the first available router.	 This configu‐
       ration yields the same behavior formerly achieved by  compiling	opensm
       with -DROUTER_EXP which has been obsoleted.

MKEY CONFIGURATION
       OpenSM  supports	 configuring  a	 single	 management key (MKey) for use
       across the subnet.

       The following configuration options are available:

	m_key		       - the 64-bit MKey to be used on the subnet
				 (IBA 14.2.4)
	m_key_protection_level - the numeric value of the MKey ProtectBits
				 (IBA 14.2.4.1)
	m_key_lease_period     - the number of seconds a CA will wait for a
				 response from the SM before resetting the
				 protection level to 0 (IBA 14.2.4.2).

       OpenSM will configure all ports	with  the  MKey	 specified  by	m_key,
       defaulting to a value of 0. A m_key value of 0 disables MKey protection
       on the subnet.  Switches and HCAs with a non-zero MKey will not	accept
       requests	 to change their configuration unless the request includes the
       proper MKey.

       MKey Protection Levels

       MKey protection levels modify how switches  and	CAs  respond  to  SMPs
       lacking a valid MKey.  OpenSM will configure each port's ProtectBits to
       support the level defined by the m_key_protection_level parameter.   If
       no  parameter  is specified, OpenSM defaults to operating at protection
       level 0.

       There are currently 4 protection levels defined by the IBA:

	0 - Queries return valid data, including MKey.	Configuration changes
	    are not allowed unless the request contains a valid MKey.
	1 - Like level 0, but the MKey is set to 0 (0x00000000) in queries,
	    unless the request contains a valid MKey.
	2 - Neither queries nor configuration changes are allowed, unless the
	    request contains a valid MKey.
	3 - Identical to 2.  Maintained for backwards compatibility.

       MKey Lease Period

       InfiniBand supports a MKey lease timeout, which is  intended  to	 allow
       administrators or a new SM to recover/reset lost MKeys on a fabric.

       If  MKeys  are  enabled	on  the	 subnet	 and a switch or CA receives a
       request that requires a valid MKey but does not contain one,  it	 warns
       the  SM	by  sending  a	trap (Bad M_Key, Trap 256).  If the MKey lease
       period is non-zero, it also starts a countdown timer for the time spec‐
       ified  by the lease period.  If a SM (or other agent) responds with the
       correct MKey, the timer is stopped and reset.  Should the  timer	 reach
       zero,  the  switch  or  CA  will	 reset its MKey protection level to 0,
       exposing the MKey and allowing recovery.

       OpenSM will initialize all ports to use a mkey lease period of the num‐
       ber  of	seconds specified in the config file.  If no mkey_lease_period
       is specified, a default of 0 will be used.

       OpenSM normally quickly responds to all Bad_M_Key traps, resetting  the
       lease  timers.	Additionally,  OpenSM's subnet sweeps will also cancel
       any running  timers.   For  maximum  protection	against	 accidentally-
       exposed	MKeys,	the  MKey  lease time should be a few multiples of the
       subnet sweep time.  If OpenSM detects at startup that your sweep inter‐
       val  is	greater	 than  your MKey lease period, it will reset the lease
       period to be greater than the sweep interval.  Similarly,  if  sweeping
       is  disabled  at	 startup,  it will be re-enabled with an interval less
       than the Mkey lease period.

       If OpenSM is required to recover a  subnet  for	which  it  is  missing
       mkeys,  it  must	 do so one switch level at a time.  As such, the total
       time to recover the subnet may be as long as the mkey lease period mul‐
       tiplied	by  the maximum number of hops between the SM and an endpoint,
       plus one.

       MKey Effects on Diagnostic Utilities

       Setting a MKey may have a detrimental effect on diagnostic software run
       on  the	subnet,	 unless	 your  diagnostic software is able to retrieve
       MKeys from the SA or can be explicitly configured with the proper MKey.
       This  is particularly true at protection level 2, where CAs will ignore
       queries for management information that do not contain the proper MKey.

ROUTING
       OpenSM now offers nine routing engines:

       1.  Min Hop Algorithm - based on the minimum hops to  each  node	 where
       the path length is optimized.

       2.   UPDN Unicast routing algorithm - also based on the minimum hops to
       each node, but it is  constrained  to  ranking  rules.  This  algorithm
       should be chosen if the subnet is not a pure Fat Tree, and deadlock may
       occur due to a loop in the subnet.

       3. DNUP Unicast routing algorithm - similar to UPDN but allows  routing
       in  fabrics  which have some CA nodes attached closer to the roots than
       some switch nodes.

       4.  Fat Tree Unicast routing algorithm - this algorithm optimizes rout‐
       ing  for	 congestion-free  "shift" communication pattern.  It should be
       chosen if a subnet is a symmetrical or almost symmetrical  fat-tree  of
       various	types,	not  just  K-ary-N-Trees:  non-constant	 K,  not fully
       staffed, any Constant Bisectional Bandwidth (CBB)  ratio.   Similar  to
       UPDN, Fat Tree routing is constrained to ranking rules.

       5. LASH unicast routing algorithm - uses Infiniband virtual layers (SL)
       to provide deadlock-free shortest-path routing while also  distributing
       the  paths  between layers. LASH is an alternative deadlock-free topol‐
       ogy-agnostic routing algorithm to the non-minimal UPDN algorithm avoid‐
       ing the use of a potentially congested root node.

       6.  DOR Unicast routing algorithm - based on the Min Hop algorithm, but
       avoids port equalization except for redundant links  between  the  same
       two  switches.	This provides deadlock free routes for hypercubes when
       the fabric is cabled as a hypercube and for meshes  when	 cabled	 as  a
       mesh (see details below).

       7. Torus-2QoS unicast routing algorithm - a DOR-based routing algorithm
       specialized for 2D/3D torus topologies.	Torus-2QoS provides  deadlock-
       free  routing while supporting two quality of service (QoS) levels.  In
       addition it is able to route around multiple failed fabric links	 or  a
       single  failed fabric switch without introducing deadlocks, and without
       changing path SL values granted before the failure.

       8. DFSSSP unicast routing algorithm -  a	 deadlock-free	single-source-
       shortest-path routing, which uses the SSSP algorithm (see algorithm 9.)
       as the base to optimize link utilization and  uses  Infiniband  virtual
       lanes (SL) to provide deadlock-freedom.

       9. SSSP unicast routing algorithm - a single-source-shortest-path rout‐
       ing algorithm, which globally balances the number of routes per link to
       optimize	 link  utilization. This routing algorithm has no restrictions
       in terms of the underlying topology.

       OpenSM also supports a file method which can load routes from a	table.
       See ´Modular Routing Engine´ for more information on this.

       The basic routing algorithm is comprised of two stages:

       1. MinHop matrix calculation
	  How many hops are required to get from each port to each LID ?
	  The  algorithm to fill these tables is different if you run standard
       (min hop) or Up/Down.
	  For standard routing, a "relaxation" algorithm is used to  propagate
       min hop from every destination LID through neighbor switches
	  For Up/Down routing, a BFS from every target is used. The BFS tracks
       link direction (up or down) and avoid steps that will perform up	 after
       a down step was used.

       2. Once MinHop matrices exist, each switch is visited and for each tar‐
       get LID a decision is made as to what port should be  used  to  get  to
       that LID.
	  This step is common to standard and Up/Down routing. Each port has a
       counter counting the number of target LIDs going through it.
	  When there are multiple alternative ports with same MinHop to a LID,
       the one with less previously assigned LIDs is selected.
	  If  LMC  >  0,  more	checks	are  added:  Within each group of LIDs
       assigned to same target port,
	  a. use only ports which have same MinHop
	  b. first prefer the ones that go to different systemImageGuid	 (then
       the previous LID of the same LMC group)
	  c. if none - prefer those which go through another NodeGuid
	  d. fall back to the number of paths method (if all go to same node).

       Effect of Topology Changes

       OpenSM  will  preserve  existing	 routing in any case where there is no
       change in the fabric switches unless the -r (--reassign_lids) option is
       specified.

       -r
       --reassign_lids
		 This option causes OpenSM to reassign LIDs to all
		 end nodes. Specifying -r on a running subnet
		 may disrupt subnet traffic.
		 Without -r, OpenSM attempts to preserve existing
		 LID assignments resolving multiple use of same LID.

       If  a  link is added or removed, OpenSM does not recalculate the routes
       that do not have to change. A route has to change if  the  port	is  no
       longer  UP or no longer the MinHop. When routing changes are performed,
       the same algorithm for balancing the routes is invoked.

       In the case of using the file based routing, any topology  changes  are
       currently  ignored  The	'file' routing engine just loads the LFTs from
       the file specified, with no reaction to real topology. Obviously,  this
       will  not be able to recheck LIDs (by GUID) for disconnected nodes, and
       LFTs for non-existent  switches	will  be  skipped.  Multicast  is  not
       affected by 'file' routing engine (this uses min hop tables).

       Min Hop Algorithm

       The  Min Hop algorithm is invoked by default if no routing algorithm is
       specified.  It can also be invoked by specifying '-R minhop'.

       The Min Hop algorithm is divided into two stages: computation  of  min-
       hop  tables  on	every switch and LFT output port assignment. Link sub‐
       scription is also equalized with the ability to override based on  port
       GUID. The latter is supplied by:

       -i <equalize-ignore-guids-file>
       -ignore-guids <equalize-ignore-guids-file>
		 This option provides the means to define a set of ports
		 (by guid) that will be ignored by the link load
		 equalization algorithm. Note that only endports (CA,
		 switch port 0, and router ports) and not switch external
		 ports are supported.

       LMC awareness routes based on (remote) system or switch basis.

       Purpose of UPDN Algorithm

       The  UPDN  algorithm is designed to prevent deadlocks from occurring in
       loops of the subnet. A loop-deadlock is a situation in which it	is  no
       longer  possible	 to  send data between any two hosts connected through
       the loop. As such, the UPDN routing algorithm should  be	 used  if  the
       subnet  is  not	a pure Fat Tree, and one of its loops may experience a
       deadlock (due, for example, to high pressure).

       The UPDN algorithm is based on the following main stages:

       1.  Auto-detect root nodes - based on the CA hop length from any switch
       in  the	subnet,	 a statistical histogram is built for each switch (hop
       num vs number of occurrences). If the  histogram	 reflects  a  specific
       column  (higher than others) for a certain node, then it is marked as a
       root node. Since the algorithm is statistical, it may not find any root
       nodes.  The  list  of the root nodes found by this auto-detect stage is
       used by the ranking process stage.

	   Note 1: The user can override the node list manually.
	   Note 2: If this stage cannot find any root nodes, and the user did
		   not specify a guid list file, OpenSM defaults back to the
		   Min Hop routing algorithm.

       2.  Ranking process - All root switch nodes  (found  in	stage  1)  are
       assigned	 a  rank of 0. Using the BFS algorithm, the rest of the switch
       nodes in the subnet are ranked incrementally. This ranking aids in  the
       process of enforcing rules that ensure loop-free paths.

       3.   Min	 Hop Table setting - after ranking is done, a BFS algorithm is
       run from each (CA or  switch)  node  in	the  subnet.  During  the  BFS
       process, the FDB table of each switch node traversed by BFS is updated,
       in reference to the starting node, based on the ranking rules and  guid
       values.

       At  the	end  of	 the  process, the updated FDB tables ensure loop-free
       paths through the subnet.

       Note: Up/Down routing does not allow LID routing communication  between
       switches that are located inside spine "switch systems".	 The reason is
       that there is no way to allow a LID route between them  that  does  not
       break  the  Up/Down  rule.  One ramification of this is that you cannot
       run SM on switches other than the leaf switches of the fabric.

       UPDN Algorithm Usage

       Activation through OpenSM

       Use '-R updn' option (instead of old '-u') to activate the  UPDN	 algo‐
       rithm.	Use  '-a  <root_guid_file>'  for adding an UPDN guid file that
       contains the root nodes for ranking.  If the `-a' option is  not	 used,
       OpenSM uses its auto-detect root nodes algorithm.

       Notes on the guid list file:

       1.    A	valid guid file specifies one guid in each line. Lines with an
       invalid format will be discarded.
       2.   The user should specify the root switch guids. However, it is also
       possible	 to  specify  CA guids; OpenSM will use the guid of the switch
       (if it exists) that connects the CA to the subnet as a root node.

       Purpose of DNUP Algorithm

       The DNUP algorithm is designed to serve a similar purpose to UPDN. How‐
       ever it is intended to work in network topologies which are unsuited to
       UPDN due to nodes being connected closer to the roots than some of  the
       switches.   An  example	would  be  a  fabric  which contains nodes and
       uplinks connected to the same switch. The operation of DNUP is the same
       as  UPDN with the exception of the ranking process.  In DNUP all switch
       nodes are ranked based solely on their  distance	 from  CA  Nodes,  all
       switch nodes directly connected to at least one CA are assigned a value
       of 1 all other switch nodes are assigned a value of one more  than  the
       minimum rank of all neighbor switch nodes.

       Fat-tree Routing Algorithm

       The fat-tree algorithm optimizes routing for "shift" communication pat‐
       tern.  It should be chosen if a subnet is a symmetrical or almost  sym‐
       metrical	 fat-tree  of  various	types.	 It supports not just K-ary-N-
       Trees, by handling for non-constant K, cases where not all leafs	 (CAs)
       are present, any CBB ratio.  As in UPDN, fat-tree also prevents credit-
       loop-deadlocks.

       If the root guid file  is  not  provided	 ('-a'	or  '--root_guid_file'
       options),  the  topology has to be pure fat-tree that complies with the
       following rules:
	 - Tree rank should be between two and eight (inclusively)
	 - Switches of the same rank should have the same number
	   of UP-going port groups*, unless they are root switches,
	   in which case the shouldn't have UP-going ports at all.
	 - Switches of the same rank should have the same number
	   of DOWN-going port groups, unless they are leaf switches.
	 - Switches of the same rank should have the same number
	   of ports in each UP-going port group.
	 - Switches of the same rank should have the same number
	   of ports in each DOWN-going port group.
	 - All the CAs have to be at the same tree level (rank).

       If the root guid file is provided, the topology doesn't have to be pure
       fat-tree, and it should only comply with the following rules:
	 - Tree rank should be between two and eight (inclusively)
	 - All the Compute Nodes** have to be at the same tree level (rank).
	   Note that non-compute node CAs are allowed here to be at different
	   tree ranks.

       *  ports that are connected to the same remote switch are referenced as
       ´port group´.

       **  list	 of  compute  nodes  (CNs)  can	 be  specified	by   ´-u´   or
       ´--cn_guid_file´ OpenSM options.

       Topologies  that	 do  not  comply  cause a fallback to min hop routing.
       Note that this can also occur on link failures which cause the topology
       to no longer be "pure" fat-tree.

       Note  that  although fat-tree algorithm supports trees with non-integer
       CBB ratio, the routing will not be as balanced as in  case  of  integer
       CBB  ratio.   In	 addition  to this, although the algorithm allows leaf
       switches to have any number of CAs, the closer the tree is to be	 fully
       populated,  the	more  effective the "shift" communication pattern will
       be.  In general, even if the root list  is  provided,  the  closer  the
       topology to a pure and symmetrical fat-tree, the more optimal the rout‐
       ing will be.

       The algorithm also dumps compute node ordering  file  (opensm-ftree-ca-
       order.dump)  in	the  same directory where the OpenSM log resides. This
       ordering file provides the CN order that may be used  to	 create	 effi‐
       cient communication pattern, that will match the routing tables.

       Routing between non-CN nodes

       The use of the cn_guid_file option allows non-CN nodes to be located on
       different levels in the fat tree.  In such case, it is  not  guaranteed
       that  the  Fat  Tree algorithm will route between two non-CN nodes.  To
       solve this problem, a list of non-CN nodes can be specified by ´-G´  or
       ´--io_guid_file´	 option.  Theses nodes will be allowed to use switches
       the wrong way round a specific number of times (specified  by  ´-H´  or
       ´--max_reverse_hops´.	 With	 the   proper	max_reverse_hops   and
       io_guid_file values, you can ensure full connectivity in the Fat Tree.

       Please note that using max_reverse_hops creates	routes	that  use  the
       switch  in  a  counter-stream way.  This option should never be used to
       connect nodes with high bandwidth traffic between them ! It should only
       be  used to allow connectivity for HA purposes or similar.  Also having
       routes the other way around can in theory cause credit loops.

       Use these options with extreme care !

       Activation through OpenSM

       Use '-R ftree' option to activate  the  fat-tree	 algorithm.   Use  '-a
       <root_guid_file>' to provide root nodes for ranking. If the `-a' option
       is not used, routing algorithm will detect  roots  automatically.   Use
       '-u  <root_cn_file>'  to provide the list of compute nodes. If the `-u'
       option is not used, all the CAs are considered as compute nodes.

       Note: LMC > 0 is not supported by fat-tree routing. If this  is	speci‐
       fied, the default routing algorithm is invoked instead.

       LASH Routing Algorithm

       LASH is an acronym for LAyered SHortest Path Routing. It is a determin‐
       istic shortest path routing algorithm that  enables  topology  agnostic
       deadlock-free routing within communication networks.

       When computing the routing function, LASH analyzes the network topology
       for the shortest-path routes between all pairs of  sources  /  destina‐
       tions  and  groups  these paths into virtual layers in such a way as to
       avoid deadlock.

       Note LASH analyzes routes and ensures deadlock freedom  between	switch
       pairs.  The link from HCA between and switch does not need virtual lay‐
       ers as deadlock will not arise between switch and HCA.

       In more detail, the algorithm works as follows:

       1) LASH determines the shortest-path between all pairs of source / des‐
       tination	 switches.  Note,  LASH	 ensures  the  same SL is used for all
       SRC/DST - DST/SRC pairs and there is no guarantee that the return  path
       for a given DST/SRC will be the reverse of the route SRC/DST.

       2)  LASH then begins an SL assignment process where a route is assigned
       to a layer (SL) if the addition of that route does not  cause  deadlock
       within  that  layer.  This  is  achieved by maintaining and analysing a
       channel dependency graph for each layer. Once the potential addition of
       a path could lead to deadlock, LASH opens a new layer and continues the
       process.

       3) Once this stage has been completed, it is  highly  likely  that  the
       first  layers  processed	 will contain more paths than the latter ones.
       To better balance the use of layers, LASH moves paths from one layer to
       another so that the number of paths in each layer averages out.

       Note,  the implementation of LASH in opensm attempts to use as few lay‐
       ers as possible. This number can be less than the number of actual lay‐
       ers available.

       In  general  LASH  is  a	 very flexible algorithm. It can, for example,
       reduce to Dimension Order Routing in certain topologies, it is topology
       agnostic and fares well in the face of faults.

       It  has been shown that for both regular and irregular topologies, LASH
       outperforms Up/Down. The reason for this is that LASH  distributes  the
       traffic	more  evenly through a network, avoiding the bottleneck issues
       related to a root node and always routes shortest-path.

       The algorithm was developed by Simula Research Laboratory.

       Use '-R lash -Q ' option to activate the LASH algorithm.

       Note: QoS support has to be turned on in order that SL/VL mappings  are
       used.

       Note:  LMC  > 0 is not supported by the LASH routing. If this is speci‐
       fied, the default routing algorithm is invoked instead.

       For open regular cartesian meshes the DOR algorithm is the ideal	 rout‐
       ing  algorithm. For toroidal meshes on the other hand there are routing
       loops that can cause deadlocks. LASH can be used to route these	cases.
       The  performance of LASH can be improved by preconditioning the mesh in
       cases where there are multiple links connecting switches	 and  also  in
       cases  where the switches are not cabled consistently. An option exists
       for LASH to do this. To invoke this use '-R  lash  -Q  --do_mesh_analy‐
       sis'.  This  will add an additional phase that analyses the mesh to try
       to determine the dimension and size of a mesh. If  it  determines  that
       the  mesh  looks	 like an open or closed cartesian mesh it reorders the
       ports in dimension order before the rest of the LASH algorithm runs.

       DOR Routing Algorithm

       The Dimension Order Routing algorithm is based on the Min Hop algorithm
       and  so	uses  shortest paths.  Instead of spreading traffic out across
       different paths with the same shortest distance, it chooses  among  the
       available shortest paths based on an ordering of dimensions.  Each port
       must be consistently cabled to represent a  hypercube  dimension	 or  a
       mesh  dimension.	  Alternatively, the -O option can be used to assign a
       custom mapping between the ports on a given switch, and the  associated
       dimension.   Paths  are grown from a destination back to a source using
       the lowest dimension (port) of available paths at each step.  This pro‐
       vides  the ordering necessary to avoid deadlock.	 When there are multi‐
       ple links between any two  switches,  they  still  represent  only  one
       dimension  and traffic is balanced across them unless port equalization
       is turned off.  In the case of hypercubes, the same port must  be  used
       throughout the fabric to represent the hypercube dimension and match on
       both ends of the cable, or the -O option used to accomplish the	align‐
       ment.  In the case of meshes, the dimension should consistently use the
       same pair of ports, one port on one end of the  cable,  and  the	 other
       port  on	 the other end, continuing along the mesh dimension, or the -O
       option used as an override.

       Use '-R dor' option to activate the DOR algorithm.

       DFSSSP and SSSP Routing Algorithm

       The (Deadlock-Free) Single-Source-Shortest-Path	routing	 algorithm  is
       designed	 to optimize link utilization thru global balancing of routes,
       while supporting arbitrary topologies.  The  DFSSSP  routing  algorithm
       uses Infiniband virtual lanes (SL) to provide deadlock-freedom.

       The DFSSSP algorithm consists of five major steps:
       1)  It  discovers the subnet and models the subnet as a directed multi‐
       graph in which each node represents a node of the physical network  and
       each  edge  represents  one  direction of the full-duplex links used to
       connect the nodes.
       2) A loop, which iterates over all CA and switches of the subnet,  will
       perform	three  steps to generate the linear forwarding tables for each
       switch:
       2.1) use Dijkstra's algorithm to find the shortest path from all	 nodes
       to the current selected destination;
       2.2)  update  the  egde	weights	 in  the graph, i.e. add the number of
       routes, which use a link to reach the destination, to the link/edge;
       2.3) update the LFT of each switch with the  outgoing  port  which  was
       used in the current step to route the traffic to the destination node.
       3)  After the number of available virtual lanes or layers in the subnet
       is detected and a channel dependency  graph  is	initialized  for  each
       layer,  the  algorithm  will put each possible route of the subnet into
       the first layer.
       4) A loop iterates over all channel dependency graphs  (CDG)  and  per‐
       forms the following substeps:
       4.1) search for a cycle in the current CDG;
       4.2)  when  a  cycle is found, i.e. a possible deadlock is present, one
       edge is selected and all routes, which induced this egde, are moved  to
       the "next higher" virtual layer (CDG[i+1]);
       4.3)  the  cycle	 search	 is  continued until all cycles are broken and
       routes are moved "up".
       5) When the number of needed layers does	 not  exceeds  the  number  of
       available SL/VL to remove all cycles in all CDGs, the rounting is dead‐
       lock-free and an	 relation  table  is  generated,  which	 contains  the
       assignment of routes from source to destination to a SL

       Note on SSSP:
       This  algorithm does not perform the steps 3)-5) and can not be consid‐
       ered to be deadlock-free for all topologies. But on the one  hand,  you
       can  choose  this  algorithm  for really large networks (5,000+ CAs and
       deadlock-free by design) to reduce the runtime of the algorithm. On the
       other hand, you might use the SSSP routing algorithm as an alternative,
       when all deadlock-free routing algorithms fail to route the network for
       whatever	 reason.   In  the  last case, SSSP was designed to deliver an
       equal or higher bandwidth due to better congestion avoidance  than  the
       Min Hop routing algorithm.

       Notes for usage:
       a) running DFSSSP: '-R dfsssp -Q'
       a.1)  QoS has to be configured to equally spread the load on the avail‐
       able SL or virtual lanes
       a.2) applications must perform a path record query to get path  SL  for
       each route, which the application will use to transmite packages
       b) running SSSP:	  '-R sssp'
       c) both algorithms support LMC > 0

       Torus-2QoS Routing Algorithm

       Torus-2QoS  is  routing	algorithm designed for large-scale 2D/3D torus
       fabrics; see torus-2QoS(8) for full documentation.

       Use '-R torus-2QoS -Q' or '-R torus-2QoS,no_fallback  -Q'  to  activate
       the torus-2QoS algorithm.

       Routing References

       To  learn  more	about deadlock-free routing, see the article "Deadlock
       Free Message Routing in	Multiprocessor	Interconnection	 Networks"  by
       William J Dally and Charles L Seitz (1985).

       To  learn  more about the up/down algorithm, see the article "Effective
       Strategy to Compute Forwarding Tables for InfiniBand Networks" by  Jose
       Carlos  Sancho,	Antonio	 Robles,  and  Jose  Duato  at the Universidad
       Politecnica de Valencia.

       To learn more about LASH and the flexibility behind it, the requirement
       for  layers,  performance comparisons to other algorithms, see the fol‐
       lowing articles:

       "Layered Routing in Irregular Networks", Lysne et al, IEEE Transactions
       on Parallel and Distributed Systems, VOL.16, No12, December 2005.

       "Routing	 for  the  ASI Fabric Manager", Solheim et al. IEEE Communica‐
       tions Magazine, Vol.44, No.7, July 2006.

       "Layered Shortest Path (LASH) Routing in	 Irregular  System  Area  Net‐
       works",	Skeie  et al. IEEE Computer Society Communication Architecture
       for Clusters 2002.

       To learn more about the DFSSSP and  SSSP	 routing  algorithm,  see  the
       articles:
       J.  Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing for
       Arbitrary Topologies, In Proceedings of	the  25th  IEEE	 International
       Parallel & Distributed Processing Symposium (IPDPS 2011)
       T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for Large-
       Scale InfiniBand Networks, In 17th Annual IEEE Symposium on  High  Per‐
       formance Interconnects (HOTI 2009)

       Modular Routine Engine

       Modular	routing engine structure allows for the ease of "plugging" new
       routing modules.

       Currently, only unicast callbacks are supported. Multicast can be added
       later.

       One  existing  routing module is up-down "updn", which may be activated
       with '-R updn' option (instead of old '-u').

       General usage is: $ opensm -R 'module-name'

       There is also a trivial routing module which is able to load LFT tables
       from a file.

       Main features:

	- this will load switch LFTs and/or LID matrices (min hops tables)
	- this will load switch LFTs according to the path entries introduced
	  in the file
	- no additional checks will be performed (such as "is port connected",
	  etc.)
	- in case when fabric LIDs were changed this will try to reconstruct
	  LFTs correctly if endport GUIDs are represented in the file
	  (in order to disable this, GUIDs may be removed from the file
	   or zeroed)

       The  file  format  is  compatible with output of 'ibroute' util and for
       whole fabric can be generated with dump_lfts.sh script.

       To activate file based routing module, use:

	 opensm -R file -U /path/to/lfts_file

       If the lfts_file is not found or is in error, the default routing algo‐
       rithm is utilized.

       The  ability  to dump switch lid matrices (aka min hops tables) to file
       and later to load these is also supported.

       The usage is similar to unicast forwarding tables loading from  a  lfts
       file  (introduced  by  'file'  routing engine), but new lid matrix file
       name should be specified by -M or --lid_matrix_file option.  For	 exam‐
       ple:

	 opensm -R file -M ./opensm-lid-matrix.dump

       The  dump  file is named ´opensm-lid-matrix.dump´ and will be generated
       in  standard  opensm  dump  directory  (/var/log	  by   default)	  when
       OSM_LOG_ROUTING logging flag is set.

       When routing engine 'file' is activated, but the lfts file is not spec‐
       ified or not cannot be open default lid matrix algorithm will be used.

       There is also a switch forwarding tables dumper which generates a  file
       compatible with dump_lfts.sh output. This file can be used as input for
       forwarding tables loading by 'file' routing engine.   Both  or  one  of
       options -U and -M can be specified together with ´-R file´.

PER MODULE LOGGING CONFIGURATION
       To  enabled  per	 module logging, set per_module_logging to TRUE in the
       opensm options file and configure per_module_logging_file there	appro‐
       priately.

       The per module logging config file format is a set of lines with module
       name and logging level as follows:

	<module name><separator><logging level>

	<module name> is the file name including .c
	<separator> is either = , space, or tab
	<logging level> is the same levels as used in the coarse/overall
	logging as follows:

	BIT    LOG LEVEL ENABLED
	----   -----------------
	0x01 - ERROR (error messages)
	0x02 - INFO (basic messages, low volume)
	0x04 - VERBOSE (interesting stuff, moderate volume)
	0x08 - DEBUG (diagnostic, high volume)
	0x10 - FUNCS (function entry/exit, very high volume)
	0x20 - FRAMES (dumps all SMP and GMP frames)
	0x40 - ROUTING (dump FDB routing information)
	0x80 - SYS (syslog at LOG_INFO level in addition to OpenSM logging)

FILES
       /etc/rdma/opensm.conf
	      default OpenSM config file.

       /etc/rdma/ib-node-name-map
	      default node name map file.  See ibnetdiscover for more informa‐
	      tion on format.

       /etc/rdma/partitions.conf
	      default partition config file

       /etc/rdma/qos-policy.conf
	      default QOS policy config file

       /etc/rdma/prefix-routes.conf
	      default prefix routes file

       /etc/rdma/per-module-logging.conf
	      default per module logging config file

       /etc/rdma/torus-2QoS.conf
	      default torus-2QoS config file

AUTHORS
       Hal Rosenstock
	      <hal@mellanox.com>

       Sasha Khapyorsky
	      <sashak@voltaire.com>

       Eitan Zahavi
	      <eitan@mellanox.co.il>

       Yevgeny Kliteynik
	      <kliteyn@mellanox.co.il>

       Thomas Sodring
	      <tsodring@simula.no>

       Ira Weiny
	      <weiny2@llnl.gov>

       Dale Purdy
	      <purdy@sgi.com>

SEE ALSO
       torus-2QoS(8), torus-2QoS.conf(5).

OpenIB				 March 8, 2012			     OPENSM(8)
[top]

List of man pages available for RedHat

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome