[BEAM-3945] TFRecord Performance Tests doesn't work on hdfs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: P3
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: io-java-tfrecord, testing
Labels:
None

Description

TFRecord have issue reading files from hdfs using filename pattern "hdfs://...*"

TFRecordIO.read().from(filenamePattern).withCompression(AUTO)

link to github this is a blocker for running full set of filebased io tests on hdfs.

Steps to reproduce:
1. Create remote hadoop environment. This step asume you have local kubectl tool configured to use your GCP project.

pushd .test-infra/kubernetes/hadoop/SmallITCluster/ && /bin/bash ./setup-all.sh && popd

2. Update /etc/hosts file with the provided output from sctipt.
3. Confirm that it works and hadoop web interface is accessible on http://hadoop-xxxxx:50070 where xxxxx is added in step2 sequence from your /etc/hosts entry. Please also substitute xxxxx in further usages of this.
4. Tell runner to use root as hadoop user.

export HADOOP_USER_NAME=root

5. Run TFRecord tests on this environment using DirectRunner:

mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests/ -Dit.test=org.apache.beam.sdk.io.tfrecord.TFRecordIOIT -Dfilesystem=hdfs -DintegrationTestPipelineOptions='["--filenamePrefix=hdfs://hadoop-xxxxx:9000/TFRecord", "--hdfsConfiguration=[{\"fs.defaultFS\" : \"hdfs://hadoop-xxxxx:9000\", \"dfs.replication\": 1, \"dfs.client.use.datanode.hostname\":\"true\"}]" ]' -DforceDirectRunner=true

The error message is:

[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 78.055 s <<< FAILURE! - in org.apache.beam.sdk.io.tfrecord.TFRecordIOIT
[ERROR] writeThenReadAll(org.apache.beam.sdk.io.tfrecord.TFRecordIOIT)  Time elapsed: 78.055 s  <<< ERROR!
java.lang.IllegalStateException: Invalid data
        at org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:444)
        at org.apache.beam.sdk.io.TFRecordIO$TFRecordCodec.read(TFRecordIO.java:642)
        at org.apache.beam.sdk.io.TFRecordIO$TFRecordSource$TFRecordReader.readNextRecord(TFRecordIO.java:526)
        at org.apache.beam.sdk.io.CompressedSource$CompressedReader.readNextRecord(CompressedSource.java:426)
        at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.advanceImpl(FileBasedSource.java:473)
        at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.advance(OffsetBasedSource.java:267)
        at org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$BoundedReadEvaluator.processElement(BoundedReadEvaluatorFactory.java:148)
        at org.apache.beam.runners.direct.DirectTransformExecutor.processElements(DirectTransformExecutor.java:161)
        at org.apache.beam.runners.direct.DirectTransformExecutor.run(DirectTransformExecutor.java:125)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

This results were also observed when running tests on jenkins. Link to jenkins build

When you open http://hadoop-xxxxx:50070/explorer.html#/ you will see TFRecord files that were created during write phase. Unable to be processed in reading phase.

Important note: if I copy files made by writing pipeline from hdfs directory to local directory and run reading pipeline over them, everything is working fine, so only reading from hdfs is a problem.

You can wipe out hdfs environment by runnning:

pushd .test-infra/kubernetes/hadoop/SmallITCluster/ && /bin/bash ./teardown-all.sh && popd

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Kamil Szewczyk

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 27/Mar/18 09:41

Updated:: 03/Jun/22 20:26