[HIVE-5590] select and get duplicated records with hive when a .defalte file greater than 64MB was loaded to a hive table - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- 64M
- count(*)
- duplited
- hdfs
- hive
- records
Environment:

cdh4

Tags:
64M hive hdfs count(*) duplited records

Description

we occasionally have some compressed file larger than 160MB in .deflate format. And it was load to hive using an external table, say table T_A.
when select count from T_A we got more records,70% more! compared with that we use "hadoop fs -text /xxxxx |wc -l" to check the file.
any clue for this? how could it happened?

the large .deflate file was due to imperfect processing , when we fixed it and get files less than 64M. the above problem did not come up. But since it is not guaranteed that a larger file would not show up again. is there any way to avoid this subject ?

cheers!
eye

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: eye

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 18/Oct/13 10:11

Updated:: 18/Oct/13 10:14

Time Tracking

Estimated:

48h

Remaining:

48h

Logged:

Not Specified