Details
-
Bug
-
Status: Open
-
P3
-
Resolution: Unresolved
-
None
-
None
-
None
Description
TFRecord have issue reading files from hdfs using filename pattern "hdfs://...*"
TFRecordIO.read().from(filenamePattern).withCompression(AUTO)
link to github this is a blocker for running full set of filebased io tests on hdfs.
Steps to reproduce:
1. Create remote hadoop environment. This step asume you have local kubectl tool configured to use your GCP project.
pushd .test-infra/kubernetes/hadoop/SmallITCluster/ && /bin/bash ./setup-all.sh && popd
2. Update /etc/hosts file with the provided output from sctipt.
3. Confirm that it works and hadoop web interface is accessible on http://hadoop-xxxxx:50070 where xxxxx is added in step2 sequence from your /etc/hosts entry. Please also substitute xxxxx in further usages of this.
4. Tell runner to use root as hadoop user.
export HADOOP_USER_NAME=root
5. Run TFRecord tests on this environment using DirectRunner:
mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests/ -Dit.test=org.apache.beam.sdk.io.tfrecord.TFRecordIOIT -Dfilesystem=hdfs -DintegrationTestPipelineOptions='["--filenamePrefix=hdfs://hadoop-xxxxx:9000/TFRecord", "--hdfsConfiguration=[{\"fs.defaultFS\" : \"hdfs://hadoop-xxxxx:9000\", \"dfs.replication\": 1, \"dfs.client.use.datanode.hostname\":\"true\"}]" ]' -DforceDirectRunner=true
The error message is:
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 78.055 s <<< FAILURE! - in org.apache.beam.sdk.io.tfrecord.TFRecordIOIT [ERROR] writeThenReadAll(org.apache.beam.sdk.io.tfrecord.TFRecordIOIT) Time elapsed: 78.055 s <<< ERROR! java.lang.IllegalStateException: Invalid data at org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:444) at org.apache.beam.sdk.io.TFRecordIO$TFRecordCodec.read(TFRecordIO.java:642) at org.apache.beam.sdk.io.TFRecordIO$TFRecordSource$TFRecordReader.readNextRecord(TFRecordIO.java:526) at org.apache.beam.sdk.io.CompressedSource$CompressedReader.readNextRecord(CompressedSource.java:426) at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl(FileBasedSource.java:473) at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.advance(OffsetBasedSource.java:267) at org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$BoundedReadEvaluator.processElement(BoundedReadEvaluatorFactory.java:148) at org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:161) at org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:125) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
This results were also observed when running tests on jenkins. Link to jenkins build
When you open http://hadoop-xxxxx:50070/explorer.html#/ you will see TFRecord files that were created during write phase. Unable to be processed in reading phase.
Important note: if I copy files made by writing pipeline from hdfs directory to local directory and run reading pipeline over them, everything is working fine, so only reading from hdfs is a problem.
You can wipe out hdfs environment by runnning:
pushd .test-infra/kubernetes/hadoop/SmallITCluster/ && /bin/bash ./teardown-all.sh && popd