1. Mellanox FCA 3.0


1.1 Overview

1.2 System Requirements

    - Mellanox OFED 2.1-0.0.X
    - Open MPI 1.7.4 or later

2. Configuring FCA 3.0

2.1 Compiling Open MPI with FCA 3.0

    1. Install FCA 3.0 from an RPM
        % rpm -ihv hcoll-x.y.z-1.x86_64.rpm

        FCA 3.0 will be installed automatically in the /opt/mellanox/hcoll
        folder.

    2. Enter the Open MPI source directory and run the following command:
        % cd $OMPI_HOME
        % ./configure --with-hcoll=/opt/mellanox/hcoll < ... other configure
         parameters>
        % make -j 9 && make install -j 9

    MLNX_OFED comes with a preinstalled version of FCA 3.0:

    - MLNX_OFED 2.XXX comes with FCA 3.X and Open MPI 1.7.X compiled with FCA
      3.x

    To check the version of FCA installed on your host, run:
        % rpm -qi hcoll

    To upgrade to a newer version of FCA 3.X:

    1. Remove the existing FCA 3.X
        %rpm -e hcoll
    2. Remove the precompiled Open MPI
        %rpm -e mlnx-openmpi_gcc
    3. Install the new FCA 3.X and compile the Open MPI with it.

2.2 Enabling FCA in Open MPI

    To enable the use of the FCA 3.0 HCOLL collectives in Open MPI, you need to
    explicitly ask for them by setting the following MCA parameter:
    "-mca coll hcoll,tuned,libnbc".

2.3 Tuning FCA 3.0 setting

The default FCA 3.0 settings should be optimal for most systems. However, to
check the available FCA 3.0 parameters and their default values, run the
following command:

    %hcoll_info --param <framework_name> <component_name>

    <framework_name> is one of the following frameworks: coll, bcol, sbgp, ofacm_rte, hcoll_mpool, netpatterns, hcoll_rcache
    <component_name> is specific for a framework. Examples are: ml, ptpcoll, ucx_p2p, basesmuma, p2p, base ...

You can also specify "all" instead of each of these parameters like

    %hcoll_info --param all all

FCA 3.0 parameters are simply environment variables and can be modified in one of two ways:

   - Modifying the default FCA 3.0 parameters as part of the mpirun command
     %mpirun ... -x HCOLL_ML_BUFFER_SIZE=65536
   - Modifying the default FCA 3.0 parameter values from SHELL:
     % export -x HCOLL_ML_BUFFER_SIZE=65536
     % mpirun ...

2.4 Selecting ports and devices

    -Select which HCA device and port you would like FCA 3.0 to run over by
     setting:
    -x HCOLL_MAIN_IB=<device_name>:<port_num>, e.g. mlx5_0:1.

3.0 Runtime configuration of FCA

3.1 Taking advantage of hierarchy

By design, FCA 3.0 is flexible and modular, giving the user a wide degree of
latitude to customize collective allgorithms to take full advantage of their
Mellanox hardware at application runtime.

The FCA 3.0 software model abstracts the notion of a memory hierarchy into
subgrouping or SBGP components. An SBGP group is a subset of endpoints that
satisfy a reachability criteria. To each SBGP is associated a set of optimized
collective primitives, basic collectives or BCOL components.

3.1.1 Available SBGPs

    - basesmuma:     subset of ranks that share the same host.
    - basesmsocket:  subset of ranks that share the same socket.
    - ibnet:         subset of ranks that can communicate with CORE-Direct.
    - p2p:           subset of ranks that can reach each other over point-to-point.

3.1.2 Available BCOLs

    - basesmuma:     shared memory primitives.
    - ucx_p2p:       UCX based point-to-point primitives.
    - ptpcoll:       Point-to-point logical layer.
    - cc:            Mellanox Cross-Channel offloads

3.1.3 Putting the pieces together

    - Two-level hierarchy with CORE-Direct used at the "top" level:

        -x HCOLL_BCOL=basesmuma,iboffload,ucx_p2p -x HCOLL_SBGP=basesmuma,ibnet,p2p

    - Three-level hierarchy with CORE-Direct used at the "top" level:

        -x HCOLL_BCOL=basesmuma,basesmuma,iboffload,ucx_p2p -x HCOLL_SBGP=basesmsocket,basesmuma,ibnet,p2p

    - Two-level hierarchy with MXM p2p used at the "top" level:

        -x HCOLL_BCOL=basesmuma,ucx_p2p -x HCOLL_SBGP=basesmuma,p2p

    - Three-level hierarchy with MXM used at the "top" level:

        -x HCOLL_BCOL=basesmuma,basesmuma,ucx_p2p -x HCOLL_SBGP=basesmsocket,basesmuma,p2p

3.1.4 Profiles for using FCA 3.0 in Open MPI

Open MPI 1.8 and above provides a mechanism to set best configuration options on a
per-application/scale basis. To use this feature, you can simply set the
corresponding mca parameter "mca_base_env_list" to the semicolon separated
list of environment variables in the following format: VAR1=VAL1;VAR2=VAL2;VAR3.
In case if value for a variable (VAR3) was not specified, it will be taken from
the current environment.

These are two examples based on different launching methods:

    %mpirun -n 4 -mca mca_base_env_list "HCOLL_BCOL=ptpcoll;HCOLL_SBGP=p2p" ...
    %env OMPI_MCA_mca_base_env_list="HCOLL_BCOL=ptpcoll;HCOLL_SBGP=p2p" srun --mpi=pmi2 -N2 -n4 ...

FCA 3.0 provides multiple profiles to simulate the most commonly used flows. These are the examples:

    - hcoll_mxm_only.conf - to use one level hierarchy using MXM p2p.
    - hcoll_2level_uma_cd_mxm.conf - to use two-level hierarchy with CORE-Direct used at the "top" level.
    - hcoll_1level_mcast.conf - to enable multicast based collectives.
    - hcoll_2level_mcast.conf - to use two-level hierarchy with multicast at the "top" level and
      shared memory at the second level.

To use a specific profile, run the following command:

    %mpirun ... -am <profile.conf> -mca mca_base_param_file_path /opt/mellanox/hcoll/etc ...
    or
    %mpirun ... -mca mca_base_param_files /opt/mellanox/hcoll/etc/<profile.conf> ...

    and in case of using srun as a launching method:
    %env OMPI_MCA_mca_base_param_files=/opt/mellanox/hcoll/etc/<profile.conf> srun --mpi=pmi2 -N2 -n4 ...

3.2 Some notes on CORE-Direct

To meet the needs of scientific research and engineering simulations,
supercomputers are growing at an unrelenting rate. As supercomputers
increase in size from mere thousands to hundreds-of-thousands of
processor cores, new performance and scalability challenges have
emerged. In the past, performance tuning of parallel applications could
be accomplished fairly easily by separately optimizing their algorithms,
communication, and computational aspects. However, as we continue to scale to
larger machines, these issues become co-mingled and must be addressed
comprehensively.

Collectives communications execute global communication operations to
couple all processes/nodes in the system and therefore must be executed as quickly and as
efficiently as possible. Most current implementations of collective operations
will suffer from the effects of systems noise at extreme-scale (the
amplification of OS interupts during collective progression randomly spread over all hosts)
and consume a significant fraction of CPU time and energy, time and energy
that could be better spent doing meaningful computation.

Mellanox Technologies has addressed these extreme scale collectives communication
scalability problems by offloading the communications to the host channel adapters (HCAs)
and switches. The technology, named CORE-Direct (Collectives Offload Resource Engine),
provides the most advanced solution for handling collectives operations, ensures maximum
scalability, minimizes the CPU overhead and provides
the capability for overlapping communications with computation.

We are pleased to expose the power of CORE-Direct for the first time in a
commercially available software package in FCA 3.0. Users may benfit
immediately from CORE-Direct out-of-the-box by simply specifying the necessary
BCOL/SBGP combinations shown in section 2.3.3. In order to take
maximum advantage of CORE-Direct, users may modify their applications to use
MPI 3.0 non-blocking routines while using CORE-Direct to offload the
collective "under-the-covers", thereby allowing maximum opportunity to overlap
communication with computation.

3.4 Enabling Mellanox specific features and optimizations

    - Multicast acceleration: Like previous versions of FCA, FCA 3.0 uses
      hardware multicast to accelerate collective primitives in "ucx_p2p"  BCOL
      when possible. It enabled by default.

      -x HCOLL_MCAST_ENABLE_ALL=1

    - SHArP (Scalable Hierarchical Aggregation Protocol):
      SHArP enables offloading of data aggregation to network to accelerate
      collective primitives in "ucx_p2p" BCOL. It currently
      enabled for barrier and allreduce collective primitives.

      -x HCOLL_ENABLE_SHARP=1.

      Please refer to SHArP deplyoment guide for more details on installation and usage.


    - Context caching: When using either of the two Mellanox specific BCOLs
      (ucx_p2p, or iboffload), you may enable context caching. This
      optimization can benefit applications that create and destroy lots of
      MPI communicators. It enabled by default.

      -x HCOLL_CONTEXT_CACHE_ENABLE=1

3.5 An example command line:

Running IMB benchmark on 1,024 MPI processes with two levels of hierarchy -
shared memory and MXM point-to-point. Enable both context caching and
multicast acceleration.

    % mpirun -np 1024 --bind-to-core -bynode -mca btl_openib_if_include mlx4_0:1
      -mca coll hcoll,tuned,libnbc -mca btl sm,openib,self
      HCOLL_MCAST_ENABLE_ALL=1
      -x HCOLL_IB_IF_INCLUDE=mlx4_0:1 -x HCOLL_BCOL=basesmuma,ucx_p2p
      -x HCOLL_SBGP=basesmuma,p2p ~/IMB/src/IMB-MPI1 -exclude PingPong PingPing Sendrecv

3.6 Enabling verbose log messages

To see verbose log messages set root verbose level:

        -x HCOLL_VERBOSE=<level>

..or/and verbose level for specific category:

    - BCOL-BASESMUMA BCOL category verbose messages:

        -x HCOLL_BCOL_BASESMUMA_VERBOSE=<level>

    - IBOFFLOAD BCOL category verbose messages:

        -x HCOLL_BCOL_IBOFFLOAD_VERBOSE=<level>

    - PTPCOLL and MLNXP2P BCOLs categories verbose messages:

        -x HCOLL_BCOL_P2P_VERBOSE=<level>

    - COLL-ML category verbose messages:

        -x HCOLL_ML_VERBOSE=<level>

    - NETPATTERNS category verbose messages:

        -x HCOLL_NETPATTERNS_BASE_VERBOSE=<level>

    - OFACMRTE category verbose messages:

        -x HCOLL_OFACM_VERBOSE=<level>

    - BASESMSOCKET SBGP category verbose messages:

        -x HCOLL_SBGP_BASESMSOCKET_VERBOSE=<level>

    - IBNET SBGP category verbose messages:

        -x HCOLL_SBGP_IBNET_VERBOSE=<level>

where <level> is integer with default value 0 (meaning absence of verbose information).
Higher verbosity levels indicate that more output and diagnostics should be displayed.

When HCOLL_VERBOSE is used all unspecified category verbosity levels set equal to its value.

4.0 FCA 3.0 Integration

In principle, FCA 3.0 can be integrated into any communication library. In
order to do so, one must first integrate the so-called RTE interface, which is
a collection of callbacks and handles that must be subsequently passed to FCA.
The best example is provided in the Open MPI source code. The "hcoll"
component contained in the OMPI "coll" framework is the runtime integration
layer of FCA into OMPI. A complete implementation of the RTE is provided. For
those implementers who wish to study a standalone example, one can be found in
/opt/mellanox/hcoll/sdk. Please refer to the SDK's README for instructions on
compiling and running. The RTE implementation can be found in the file
"hcoll_sdk.c". Implementors should allow 4-6 weeks to complete the task of
FCA 3.0 integration.


