pydata

Keep Looking, Don't Settle

2019-06-08 Week 23 Regular Expression to clean data

regular expression example

We have many sql alike code to patch the system. The purpose is to clean the code to do two things: 1) to find what variables are used; 2) to find the value used in the code. That is, transfer to key-value form so that it can be easily analyzed in python dataFrame.

Some examples are:

The input string is s = 'segment == 5 && age > 30', the output will be (segment, == 5) and (age, > 30)

The input string is s = 'imatches("^(US|)\(", ctry_cd) && (age >= 10)', the output will be (ctry_cd, ^(US|)\)) and (age, >= 10)

Let's start from an example:

cond = ['$seg == 5 ', ' imatches("^228$",$gls) ', ' $ipage > 30 ', ' $fpage < 2 ', ' $dayob > 200 ', 
        ' $ccage > 30 ', ' $val > 99 ', ' $day7GLRepeatCount > 0 ', ' $opvCustomerGCOrderCount90Days == 0 ', 
        ' imatches("US",$ictry_cd) ', ' imatches("US",$bctry_cd) ', ' imatches("^(US|)$",$cctry_cd) ', ' $sqPercentile > 0.7']

def replaceMultiple(s, rep = {'(':'', ')':''}):
    '''
    re multile pattern in s
    '''
    rep = dict((re.escape(k), v) for k, v in rep.items())
    pattern = re.compile("|".join(rep.keys()))
    #print(pattern)
    return pattern.sub(lambda m: rep[re.escape(m.group(0))], s)

def getVarFromCondition2(string):
    '''
    return the variables and its value in the rule
    # rep1 -- lookforward anything followed by pattern ==,<=,>=,<,> but pattern not included in output, 也就是pattern ==,<=,>=,<,> 的左边
    # rep2 -- lookbehind anything before pattern ==,<=,>=,<,> but patten not included in output, 也就是 pattern ==,<=,>=,<,> 的右边
    '''
    res = []
    rep1 = r'.*(?===)|.*(?=<=)|.*(?=>=)|.*(?=<)|.*(?=>)' 
    rep2 = r'(?<===).*|(?<=<=).*|(?<=>=).*|(?<=<).*|(?<=>).*'  
    rep3 = r'==.+ | <.+ | >.* | <=.+ | >=.+'
    #rep1 = r'[^ -][^==]*$'
    for s in string:
        # remove blanks
        s = "".join(s.split())
        print(s)
        if 'match' in s.lower():
            res.append(replaceMultiple(re.findall(r'\(.+\)', s)[0]).split(',')[::-1])
        else:
            res.append((re.findall(rep1, s)[0], re.findall(rep2, s)[0]))
    return res

getVarFromCondition2(cond)


    Out[38]:
    [('$seg', '5'),
     ['$gls', '"^228$"'],
     ('$ipage', '30'),
     ('$fpage', '2'),
     ('$dayob', '200'),
     ('$ccage', '30'),
     ('$val', '99'),
     ('$day7GLRepeatCount', '0'),
     ('$opvCustomerGCOrderCount90Days', '0'),
     ['$ictry_cd', '"US"'],
     ['$bctry_cd', '"US"'],
     ['$cctry_cd', '"^US|$"'],
     ('$sqPercentile', '0.7')]

Reference

  1. How to replace multiple substrings of a string?
  2. Regex get text before and after a hyphen
  3. How to strip all whitespace from string
  4. https://www.regular-expressions.info/lookaround.html