Uploaded image for project: 'Apache Ozone'
  1. Apache Ozone
  2. HDDS-3586

OM HA can be started with 3 isolated LEADER instead of one OM ring

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      Steps to reproduce:

      Imagine that I have 3 different om with the following DNS names:

      ozone-om-0.ozone-om
      ozone-om-1.ozone-om
      ozone-om-2.ozone-om
      

      I configured the three hosts as the following:

        OZONE-SITE.XML_ozone.om.nodes.omservice: om1,om2,om3
        OZONE-SITE.XML_ozone.om.address.omservice.om1: ozone-om-0
        OZONE-SITE.XML_ozone.om.address.omservice.om2: ozone-om-1
        OZONE-SITE.XML_ozone.om.address.omservice.om3: ozone-om-2
        OZONE-SITE.XML_ozone.om.ratis.enable: "true"
      

      But unfortunately the DNS is not reliable. All the hosts can resolve only the LOCAL hostname.

      OMHANodeDetails.java ignores ALL the configuration which are not resolvable:

       if (!addr.isUnresolved()) {
                if (!isPeer && OmUtils.isAddressLocal(addr)) {
                  localRpcAddress = addr;
                  localOMServiceId = serviceId;
                  localOMNodeId = nodeId;
                  localRatisPort = ratisPort;
                  found++;
                } else {
                  // This OMNode belongs to same OM service as the current OMNode.
                  // Add it to peerNodes list.
                  // This OMNode belongs to same OM service as the current OMNode.
                  // Add it to peerNodes list.
                  peerNodesList.add(getHAOMNodeDetails(conf, serviceId,
                      nodeId, addr, ratisPort));
                }
              }
      

      As a result I will have 3 running server but each has 1 one-node Ratis ring (peerNodesList is empty as only the local hostname can be resolved).

      Group ID is the same for all. But they have separated database and they work as separated OM which is VERY dangerous.

      1. Option one: we can accept any unresolved address and retry with connection create if it couldn't be connected

      2. Option two: at least the error handling should be fixed. When I configured 3 om, there supposed to be 3 om.

      Attachments

        Issue Links

          Activity

            People

              hanishakoneru Hanisha Koneru
              elek Marton Elek
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: