Python 3.3.2 - Finding Image Sources In Html

Question

I need to locate and extract image sources from an html file. For example, it might contain:

or

enter image description here

Examples

Live Regex Demo Live Python Demo

Sample Text

Note the rather difficult edge cases in the first line

<imgonmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } 'src="http://another.example/picture.png"><imagesomethingrandomclass="logo"src="http://example.site/imagesomethingrandom.jpg"><imageclass="logo"src="http://example.site/logo.jpg"><imgsrc="http://another.example/DoubleQuoted.png"><imagesrc='http://another.example/SingleQuoted.png'><imgsrc=http://another.example/NotQuoted.png>

Python Code

#!/usr/bin/pythonimport re

string = """<img onmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } ' src="http://another.example/picture.png">
<imagesomethingrandom class="logo" src="http://example.site/imagesomethingrandom.jpg">
<image class="logo" src="http://example.site/logo.jpg">
<img src="http://another.example/DoubleQuoted.png">
<image src='http://another.example/SingleQuoted.png'>
<img src=http://another.example/NotQuoted.png>
""";

regex = r"""<ima?ge?(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\ssrc=(['"]?)(.*?)\1(?:\s|>))(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>""";

intCount = 0for matchObj in re.finditer( regex, string, re.M|re.I|re.S):
    print" "print"[", intCount, "][ 0 ] : ", matchObj.group(0)
    print"[", intCount, "][ 1 ] : ", matchObj.group(1)
    print"[", intCount, "][ 2 ] : ", matchObj.group(2)
    intCount+=1

Capture Groups

Group 0 gets the entire image or img tag Group 1 gets the quote which surrounded src attribute, if it exists Group 2 gets the src attribute value

[ 0 ][ 0 ] :  <imgonmouseover=' src="NotTheDroidsYouAreLookingFor.png" ; if (x > 3) { funRotate(src); } 'src="http://another.example/picture.png">
[ 0 ][ 1 ] :  "
[ 0 ][ 2 ] :  http://another.example/picture.png

[ 1 ][ 0 ] :  <imageclass="logo"src="http://example.site/logo.jpg">
[ 1 ][ 1 ] :  "
[ 1 ][ 2 ] :  http://example.site/logo.jpg

[ 2 ][ 0 ] :  <imgsrc="http://another.example/DoubleQuoted.png">
[ 2 ][ 1 ] :  "
[ 2 ][ 2 ] :  http://another.example/DoubleQuoted.png

[ 3 ][ 0 ] :  <imagesrc='http://another.example/SingleQuoted.png'>
[ 3 ][ 1 ] :  '
[ 3 ][ 2 ] :  http://another.example/SingleQuoted.png

[ 4 ][ 0 ] :  <imgsrc=http://another.example/NotQuoted.png>
[ 4 ][ 1 ] :  
[ 4 ][ 2 ] :  http://another.example/NotQuoted.png

Html5 Guide

Python 3.3.2 - Finding Image Sources In Html

Examples

Solution 2:

Solution 3:

rubular

Solution 4:

Post a Comment for "Python 3.3.2 - Finding Image Sources In Html"