Details
-
Bug
-
Status: Open
-
P3
-
Resolution: Unresolved
-
2.15.0
-
None
-
None
-
dataflow
Description
I am running apache beam job to parse japanese html pages. While running the job, I see in stackdriver log it is showing japanese character properly. But same data written to GCS bucket has encoding issue and it is getting corrupted.
//code Pipeline pipeline = Pipeline.create(options); CoderRegistry cr = pipeline.getCoderRegistry(); cr.registerCoderForClass(String.class, StringUtf8Coder.of()); cr.registerCoderForClass(Integer.class, BigEndianIntegerCoder.of()); batchTuple = pipeline .apply("Read from input files", TextIO.read().from(options.getloadingBucketURL()).withCompression(Compression.GZIP)).setCoder(StringUtf8Coder.of()) .apply("Process input files",ParDo.of(new ExtractDataFromHtmlPage(extractionConfig,beamConfig.getLoadingBucketURL())).withOutputTags(successRecord, TupleTagList.of(errorRecord).and(deadLetterRecords)));