Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-1043

Concurrent access to state on local FS by multiple supervisors

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 0.9.5
    • None
    • storm-core

    Description

      Hi,

      we are running storm-mesos cluster and occassionaly workers die or are "lost" in mesos. When this happens it often coincides with errors in logs related to supervisors local state.

      By looking at the storm code it seems this might be caused by the way how multiple supervisor processes access the local state in the same directory via VersionedStore.

      For example: https://github.com/apache/storm/blob/master/storm-core/src/clj/backtype/storm/daemon/supervisor.clj#L434

      Here every supervisor does this concurrently:
      1. reads latest state from FS
      2. possibly updates the state
      3. writes the new version of the state

      Some updates could be lost if there are 2+ supervisors and they execute above steps concurrently - then only the updates from last supervisor would remain on the last state version on the disk.

      We observed local state changes quite often (seconds), so the likelihood of this concurrency issue occurring is high.

      Some examples of exeptions:
      ------------------------------------------
      java.lang.RuntimeException: Version already exists or data already exists
      at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:85) ~[storm-core-0.9.5.jar:0.9.5]
      at backtype.storm.utils.VersionedStore.createVersion(VersionedStore.java:79) ~[storm-core-0.9.5.jar:0.9.5]
      at backtype.storm.utils.LocalState.persist(LocalState.java:101) ~[storm-core-0.9.5.jar:0.9.5]
      at backtype.storm.utils.LocalState.put(LocalState.java:82) ~[storm-core-0.9.5.jar:0.9.5]
      at backtype.storm.utils.LocalState.put(LocalState.java:76) ~[storm-core-0.9.5.jar:0.9.5]
      at backtype.storm.daemon.supervisor$mk_synchronize_supervisor$this7400.invoke(supervisor.clj:382) ~[storm-core-0.9.5.jar:0.9.5]
      at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) ~[storm-core-0.9.5.jar:0.9.5]
      at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
      at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]

      ---------------------------------------
      java.io.FileNotFoundException: File '/var/lib/storm/supervisor/localstate/1441034838231' does not exist
      at org.apache.commons.io.FileUtils.openInputStream(FileUtils.java:299) ~[commons-io-2.4.jar:2.4]
      at org.apache.commons.io.FileUtils.readFileToByteArray(FileUtils.java:1763) ~[commons-io-2.4.jar:2.4]
      at backtype.storm.utils.LocalState.deserializeLatestVersion(LocalState.java:61) ~[storm-core-0.9.5.jar:0.9.5]
      at backtype.storm.utils.LocalState.snapshot(LocalState.java:47) ~[storm-core-0.9.5.jar:0.9.5]
      at backtype.storm.utils.LocalState.get(LocalState.java:72) ~[storm-core-0.9.5.jar:0.9.5]
      at backtype.storm.daemon.supervisor$sync_processes.invoke(supervisor.clj:234) ~[storm-core-0.9.5.jar:0.9.5]
      at clojure.lang.AFn.applyToHelper(AFn.java:161) [clojure-1.5.1.jar:na]
      at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.5.1.jar:na]
      at clojure.core$apply.invoke(core.clj:619) ~[clojure-1.5.1.jar:na]
      at clojure.core$partial$fn4190.doInvoke(core.clj:2396) ~[clojure-1.5.1.jar:na]
      at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.5.1.jar:na]
      at backtype.storm.event$event_manager$fn2625.invoke(event.clj:40) ~[storm-core-0.9.5.jar:0.9.5]
      at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
      at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
      -----------------------------------------

      Attachments

        Activity

          People

            erikdw Erik Weathers
            ernisv Ernestas Vaiciukevičius
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: