Friday, 15 December 2017

Data cleaning using Python


Hi, All
Data cleaning is considered vital. See the code below.



import HTMLParser
html_parser = HTMLParser.HTMLParser()
tweet = html_parser.unescape(original_tweet)
# Actual text
#“I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com”
tweet = original_tweet.decode("utf8").encode(‘ascii’,’ignore’)
APPOSTOPHES = {“'s" : " is", "'re" : " are", ...} ## Need a huge dictionary

words = tweet.split()

reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words]

reformed = " ".join(reformed)
cleaned = “ ”.join(re.findall(‘[A-Z][^A-Z]*’, original_tweet))
    tweet = _slang_loopup(tweet)
tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))


https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/







Advanced data cleaning:

Grammar checking:
Spelling correction:

No comments:

Post a Comment