Hi, All
Data cleaning is considered vital. See the code below.
import HTMLParser
html_parser = HTMLParser.HTMLParser()
tweet = html_parser.unescape(original_tweet)
# Actual text
#“I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com”
tweet = original_tweet.decode("utf8").encode(‘ascii’,’ignore’)
APPOSTOPHES = {“'s" : " is", "'re" : " are", ...} ## Need a huge dictionary
words = tweet.split()
reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words]
reformed = " ".join(reformed)
cleaned = “ ”.join(re.findall(‘[A-Z][^A-Z]*’, original_tweet))
tweet = _slang_loopup(tweet)
tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))
https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/
Advanced data cleaning:
Grammar checking:
Spelling correction:
No comments:
Post a Comment