Add section on composing and verbose regular expressions

gjbex · gjbex · commit 11160e9ff350 · 2021-08-25T17:39:08.000+02:00
diff --git a/source-code/regexes/regexes.ipynb b/source-code/regexes/regexes.ipynb
@@ -4,14 +4,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Regular expressions in Python"
+    "Regular expressions are very useful in many situations, and not exclusive to Python.  In fact, once you grasp the concepts, you'll find them indispensible and use them (or miss) them for many programming and data management tasks.  This notebook intends to give you a flavor of the possibilities, it doesn't intend to be a comprehensive overview."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Regular expressions are very useful in many situations, and not exclusive to Python.  In fact, once you grasp the concepts, you'll find them indispensible and use them (or miss) them for many programming and data management tasks.  This notebook intends to give you a flavor of the possibilities, it doesn't intend to be a comprehensive overview."
+    "# Requirements"
    ]
   },
   {
@@ -23,9 +23,9 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
    "metadata": {
-    "collapsed": true
+    "tags": []
    },
    "outputs": [],
    "source": [
@@ -36,7 +36,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Match making"
+    "# Match making"
    ]
   },
   {
@@ -57,7 +57,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "outputs": [],
    "source": [
@@ -78,7 +81,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "outputs": [],
    "source": [
@@ -99,7 +105,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "outputs": [],
    "source": [
@@ -120,7 +129,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "outputs": [],
    "source": [
@@ -134,7 +146,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Extracting stuff"
+    "# Extracting stuff"
    ]
   },
   {
@@ -155,7 +167,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "outputs": [],
    "source": [
@@ -177,7 +192,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Substitution"
+    "# Substitution"
    ]
   },
   {
@@ -191,7 +206,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "outputs": [],
    "source": [
@@ -212,7 +230,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "outputs": [],
    "source": [
@@ -221,6 +242,183 @@
     "    new_file_name = re.sub(r'(\\w+)_(\\d+)\\.', r'\\2_\\1.', file_name)\n",
     "    print('{old:15s} -> {new}'.format(old=file_name, new=new_file_name))"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Composition"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Sophisticated regular expressions tend to be very hard to read.  There are a couple of things you can do to mitigate that issue.\n",
+    "* Use `re.VERBOSE` so that you can add whitespace and comments to the regular expression defintions.\n",
+    "* Use composition, i.e., define regular expressions that describe part of the match, and compose those t match the entire expression."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Consider the following example, a log message.  We want to extract the date-time information, the log level, the process number and the exit value."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "log_entry = '2021-08-25 17:04:23.439405 [info]: end process 1 exited with 2'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Rather than writing a regular expression that describes the entire log message, we write expressions that match part of it."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "date = r'\\d{4}-\\d{2}-\\d{2}'\n",
+    "time = r'\\d{2}:\\d{2}:\\d{2}\\.\\d+'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's check that the time matches."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'17:04:23.439405'"
+      ]
+     },
+     "execution_count": 34,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "match = re.search(time, log_entry)\n",
+    "match.group(0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We cna now use `date` and `time` to match the entire date-time value."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'2021-08-25 17:04:23.439405'"
+      ]
+     },
+     "execution_count": 45,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "regex = re.compile(r'({date}\\s+{time})'.format(date=date, time=time))\n",
+    "match = regex.search(log_entry)\n",
+    "match.group(1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "level = r'\\[(\\w+)\\]'\n",
+    "msg = r'end\\s+process\\s+(\\d+)\\s+exited\\s+with\\s+(\\d+)'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "datetime = 2021-08-25 17:04:23.439405\n",
+      "log level: info\n",
+      "process = 1\n",
+      "exit status = 2\n"
+     ]
+    }
+   ],
+   "source": [
+    "regex = re.compile(r'({date}\\s+{time})\\s+{level}\\s*:\\s*{msg}'.format(date=date, time=time, level=level, msg=msg))\n",
+    "match = regex.match(log_entry)\n",
+    "print(f'datetime = {match.group(1)}')\n",
+    "print(f'log level: {match.group(2)}')\n",
+    "print(f'process = {match.group(3)}')\n",
+    "print(f'exit status = {match.group(4)}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Although the final regular expression is still rather long, it is easier to read and to maintain.  Using `re.VERBOSE` and triple-quoted strings helps to further make the regular expression more maintainable."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "datetime = 2021-08-25 17:04:23.439405\n",
+      "log level: info\n",
+      "process = 1\n",
+      "exit status = 2\n"
+     ]
+    }
+   ],
+   "source": [
+    "regex = re.compile(r'''\n",
+    "    ({date}\\s+{time})\\s+        # date-time, up to microsecond presision\n",
+    "    {level}\\s*:\\s*              # log level of the log message\n",
+    "    {msg}                       # actual log message\n",
+    "    '''.format(date=date, time=time, level=level, msg=msg), re.VERBOSE)\n",
+    "match = regex.match(log_entry)\n",
+    "print(f'datetime = {match.group(1)}')\n",
+    "print(f'log level: {match.group(2)}')\n",
+    "print(f'process = {match.group(3)}')\n",
+    "print(f'exit status = {match.group(4)}')"
+   ]
   }
  ],
  "metadata": {
@@ -239,9 +437,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.5.1"
+   "version": "3.7.7"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 0
+ "nbformat_minor": 4
 }