[BEAM-8423] Japanese characters encoding issue - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: P3
Resolution: Unresolved
Affects Version/s: 2.15.0
Fix Version/s: None
Component/s: runner-dataflow
Labels:
None
Environment:
dataflow

Description

I am running apache beam job to parse japanese html pages. While running the job, I see in stackdriver log it is showing japanese character properly. But same data written to GCS bucket has encoding issue and it is getting corrupted.

//code


Pipeline pipeline = Pipeline.create(options);
CoderRegistry cr = pipeline.getCoderRegistry();
cr.registerCoderForClass(String.class, StringUtf8Coder.of());
cr.registerCoderForClass(Integer.class, BigEndianIntegerCoder.of());

batchTuple = pipeline
		.apply("Read from input files", TextIO.read().from(options.getloadingBucketURL()).withCompression(Compression.GZIP)).setCoder(StringUtf8Coder.of())
		.apply("Process input files",ParDo.of(new ExtractDataFromHtmlPage(extractionConfig,beamConfig.getLoadingBucketURL())).withOutputTags(successRecord, TupleTagList.of(errorRecord).and(deadLetterRecords)));

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Jyoti Aditya

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Oct/19 12:53

Updated:: 04/Jun/22 14:27