10 Python String Processing Suggestions & Tips



Pure language processing and textual content analytics are sizzling areas of analysis and software in the meanwhile. These fields entail all kinds of particular abilities and ideas requiring thorough understanding earlier than shifting into significant observe. Previous to attending to that time, nevertheless, primary string manipulation and processing is a should.

There are 2 distinct kinds of broad computational string processing abilities that should be broached, in my view. The primary of those is regular expressions, a pattern-based strategy to textual content matching. There are quite a few nice introductions to common expressions one can get hold of, however visible learners could respect the fast.ai Code-First Intro to Natural Language Processing course video on the subject.

The opposite distinct computational string processing talent is having the ability to leverage a given programming language’s customary library for primary string manipulation. As such, this text is a brief Python string processing primer for these toying with the concept of pursuing a extra in-depth textual content analytics profession.

Need to unlock worth and understanding in all of that textual content your organization has? You higher be sure you perceive essentially the most primary of fundamentals first. Take a look at these newbie methods for perception.

Be aware that significant textual content analytics go manner past string processing, and the core of those extra superior strategies could not require you to control textual content by yourself fairly often. Nonetheless, textual content information pre-processing is a crucial and time-consuming a part of a profitable textual content analytics venture, and these above-mentioned string processing abilities will likely be invaluable right here. Understanding the computational processing of textual content at a primary stage is conceptually crucial to understanding extra superior textual content analytics strategies as properly.

Most of the following examples make use of the Python customary library string module, and so having it helpful for reference is a good suggestion.


1. Stripping Whitepsace

Stripping whitespace is an elementary string processing requirement. You’ll be able to strip main whitespace with the lstrip() methodology (left), trailing whitespace with rstrip() (proper), and each main and trailing with strip().

s = '   This can be a sentence with whitespace.       n'

print('Strip main whitespace: '.format(s.lstrip()))
print('Strip trailing whitespace: '.format(s.rstrip()))
print('Strip all whitespace: '.format(s.strip()))
Strip main whitespace: This can be a sentence with whitespace.       

Strip trailing whitespace:    This can be a sentence with whitespace.
Strip all whitespace: This can be a sentence with whitespace.


Fascinated about stripping characters apart from whitespace? The identical strategies are useful, and are utilized by passing within the character(s) you need stripped.

s = 'This can be a sentence with undesirable characters.AAAAAAAA'

print('Strip undesirable characters: '.format(s.rstrip('A')))
Strip undesirable characters: This can be a sentence with undesirable characters.


Remember to take a look at the string format() documentation if needed.


2. Splitting Strings

Splitting strings into lists of smaller substrings is usually helpful and simply completed in Python with the split() methodology.

s = 'KDnuggets is a incredible useful resource'

print(s.cut up())
['KDnuggets', 'is', 'a', 'fantastic', 'resource']


By default, cut up() splits on whitespace, however different character(s) sequences could be handed in as properly.

s = 'these,phrases,are,separated,by,comma'
print('',' separated cut up -> '.format(s.cut up(',')))

s = 'abacbdebfgbhhgbabddba'
print(''b' separated cut up -> '.format(s.cut up('b')))
',' separated cut up -> ['these', 'words', 'are', 'separated', 'by', 'comma']
'b' separated cut up -> ['a', 'ac', 'de', 'fg', 'hhg', 'a', 'dd', 'a']



3. Becoming a member of Checklist Parts Right into a String

Want the alternative of the above operation? You’ll be able to be part of checklist ingredient strings right into a single string in Python utilizing the join() methodology.

s = ['KDnuggets', 'is', 'a', 'incredible', 'useful resource']

print(' '.be part of(s))
KDnuggets is a incredible useful resource


Ain’t that the reality! And if you wish to be part of checklist parts with one thing apart from whitespace in between? This factor could also be slightly bit stranger, but additionally simply completed.

s = ['Eleven', 'Mike', 'Dustin', 'Lucas', 'Will']

print(' and '.be part of(s))
Eleven and Mike and Dustin and Lucas and Will



4. Reversing a String

Python doesn’t have a built-in string reverse methodology. Nonetheless, provided that strings could be sliced like lists, reversing one could be accomplished in the identical succinct vogue checklist’s parts could be reversed.

s = 'KDnuggets'

print('The reverse of KDnuggets is '.format(s[::-1]))
The reverse of KDnuggets is: steggunDK



5. Changing Uppercase and Lowercase

Changing between circumstances could be accomplished with the upper(), lower(), and swapcase() strategies.

s = 'KDnuggets'

print(''KDnuggets' as uppercase: '.format(s.higher()))
print(''KDnuggets' as lowercase: '.format(s.decrease()))
print(''KDnuggets' as swapped case: '.format(s.swapcase()))
'KDnuggets' as uppercase: KDNUGGETS
'KDnuggets' as lowercase: kdnuggets
'KDnuggets' as swapped case: kdNUGGETS



6. Checking for String Membership

The best strategy to test for string membership in Python is utilizing the in operator. The syntax may be very pure language-like.

s1 = 'perpendicular'
s2 = 'pen'
s3 = 'pep'

print(''pen' in 'perpendicular' -> '.format(s2 in s1))
print(''pep' in 'perpendicular' -> '.format(s3 in s1))
'pen' in 'perpendicular' -> True
'pep' in 'perpendicular' -> False


In case you are extra eager about discovering the placement of a substring inside a string (versus merely checking whether or not or not the substring is contained), the discover() string methodology could be extra useful.

s = 'Does this string comprise a substring?'

print(''string' location -> '.format(s.discover('string')))
print(''spring' location -> '.format(s.discover('spring')))
'string' location -> 10
'spring' location -> -1


discover() returns the index of the primary character of the primary incidence of the substring by default, and returns -1 if the substring shouldn’t be discovered. Examine the documentation for obtainable tweaks to this default habits.


7. Changing Substrings

What if you wish to change substrings, as an alternative of simply discover them? The Python replace() string methodology will care for that.

s1 = 'The speculation of information science is of the utmost significance.'
s2 = 'observe'

print('The brand new sentence: '.format(s1.change('idea', s2)))
The brand new sentence: The observe of information science is of the utmost significance.


An elective rely argument can specify the utmost variety of successive replacements to make if the identical substring happens a number of instances.


8. Combining the Output of A number of Lists

Have a number of lists of strings you need to mix collectively in some element-wise vogue? No drawback with the zip() perform.

international locations = ['USA', 'Canada', 'UK', 'Australia']
cities = ['Washington', 'Ottawa', 'London', 'Canberra']

for x, y in zip(international locations, cities):
  print('The capital of  is .'.format(x, y))
The capital of USA is Washington.
The capital of Canada is Ottawa.
The capital of UK is London.
The capital of Australia is Canberra.



9. Checking for Anagrams

Need to test if a pair of strings are anagrams of each other? Algorithmically, all we have to do is rely the occurrences of every letter for every string and test if these counts are equal. That is easy utilizing the Counter class of the collections module.

from collections import Counter
def is_anagram(s1, s2):
  return Counter(s1) == Counter(s2)

s1 = 'hear'
s2 = 'silent'
s3 = 'runner'
s4 = 'neuron'

print(''hear' is an anagram of 'silent' -> '.format(is_anagram(s1, s2)))
print(''runner' is an anagram of 'neuron' -> '.format(is_anagram(s3, s4)))
'hear' an anagram of 'silent' -> True
'runner' an anagram of 'neuron' -> False



10. Checking for Palindromes

How about if you wish to test whether or not a given phrase is a palindrome? Algorithmically, we have to create a reverse of the phrase after which use the == operator to test if these 2 strings (the unique and the reverse) are equal.

def is_palindrome(s):
  reverse = s[::-1]
  if (s == reverse):
    return True
  return False

s1 = 'racecar'
s2 = 'hippopotamus'

print(''racecar' a palindrome -> '.format(is_palindrome(s1)))
print(''hippopotamus' a palindrome -> '.format(is_palindrome(s2)))
'racecar' is a palindrome -> True
'hippopotamus' is a palindrome -> False


These string processing “tricks” will not make you a textual content analytics or pure language processing knowledgeable on their very own, however they could give somebody the curiosity in pursuing these fields and studying the strategies which might be needed for ultimately changing into simply such an knowledgeable.


About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *