There loads of these all over the place, but here’s some useful preg_replace examples for text and html processing that were hard to find or I ended up writing  – use/praise/embellish/flame as you see fit.

Remove repeated words (case insensitive)

$text = preg_replace("/\s(\w+\s)\1/i", "$1", $text);

‘Keep your your head’ becomes ‘Keep your head’

 Remove repeated punctuation

$text = preg_replace("/\.+/i", ".", $text); 

 ‘Keep your head…’ becomes ‘Keep your head.’ Don’t forget to escape regex characters.

Clean up a sentence end that has no trailing space

$text = preg_replace("/\.(?! )/i", ". ", $text);

‘Keep your head.Don’t fall apart’ becomes ‘Keep your head. Don’t fall apart’  This uses lookahead.

Remove carriage returns, line feeds and tabs

$text = str_replace(array("\r\n", "\r", "\n", "\t"), '', $text);

An oldy but goody.

Get all image urls from an html document

$images = array();
preg_match_all('/(img|src)\=(\"|\')[^\"\'\>]+/i', $data, $media);
unset($data);
$data=preg_replace('/(img|src)(\"|\'|\=\"|\=\')(.*)/i',"$3",$media[0]);
foreach($data as $url)
{
	$info = pathinfo($url);
	if (isset($info['extension']))
	{
		if (($info['extension'] == 'jpg') || 
		($info['extension'] == 'jpeg') || 
		($info['extension'] == 'gif') || 
		($info['extension'] == 'png'))
		array_push($images, $url);
	}
}
Puts all the image URLs in an array

Strip non printable characters

$text = preg_replace("/[^[:print:]]+/", "", $text);

Does what it says on the tin

Remove HTML tags

$text = preg_replace
	(
	array(
	// Remove invisible content
	'@<head[^>]*?>.*?</head>@siu',
	'@<style[^>]*?>.*?</style>@siu',
	'@<script[^>]*?.*?</script>@siu',
	'@<object[^>]*?.*?</object>@siu',
	'@<embed[^>]*?.*?</embed>@siu',
	'@<applet[^>]*?.*?</applet>@siu',
	'@<noframes[^>]*?.*?</noframes>@siu',
	'@<noscript[^>]*?.*?</noscript>@siu',
	'@<noembed[^>]*?.*?</noembed>@siu',
	// Add line breaks before & after blocks
	'@<((br)|(hr))@iu',
	'@</?((address)|(blockquote)|(center)|(del))@iu',
	'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
	'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
	'@</?((table)|(th)|(td)|(caption))@iu',
	'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
	'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
	'@</?((frameset)|(frame)|(iframe))@iu',),
	array(
	' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
	"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
	"\n\$0", "\n\$0",),$text
	)
// Remove all remaining tags and comments and return.
$text = strip_tags( $text );

Ok, so strip_tags sort of does this, but fails to remove script, style etc etc.

PHP performance tips

April 22, 2009

It’s Sunday morning, the weather’s terrific and I’m sitting at the desk in the office at home finishing off some project work that should have been done on Friday. Still, it’ll be done in a jiffy and I can take the dog out to the meadow in St Ives and reflect on why I don’t work at home every day rather than waste 4 hours commuting. It’s more productive, it doesn’t seem like work and I just get more time.

Anyway, I can’t talk too much about the project itself, other than it’s a website developed using PHP which involves extensive natural language parsing and processing. It needed to be performant and scalable so, although I’ve used PHP many times before, I thought it would be useful to revisit PHP performance.

Consequently, I build a small performance testing framework so I could quickly evaluate which PHP methods yielded the best results. Over the duration of the development I’ve compiled the following list which I thought I’d share.

  • Use single quotes over double quotes.
  • Use switch over lots of if statements.
  • Avoid testing loop conditionals with function tests every iteration eg. for($i=0;i<=count($x);$i++){…
  • Use foreach for looping collections/arrays.
  • PHP4 items are byval PHP5 items are byref
  • Consider using the Singleton Method when creating complex PHP classes.
  • Use POST over GET for all values that will wind up in the database for TCP/IP packet performance reasons.
  • Use ctype_alnum,ctype_alpha and ctype_digit over regular expression to test form value types for performance reasons.
  • Use full file paths in production environment over basename/fileexists/open_basedir to avoid performance hits for the filesystem having to hunt through the file path. Once determined, serialize and/or cache path values in a $_SETTINGS array. $_SETTINGS[“cwd”]=cwd(./);
  • Use require/include over require_once/include_once to ensure proper opcode caching.
  • Use tmpfile or tempnam for creating temp files/filenames
  • Use a proxy to access web services (XML or JSOM) on foreign domains using XMLHTTP to avoid cross-domain errors. eg. wibble.com<–>XMLHTTP<–>wobble.com
  • Use error_reporting (E_ALL); during debug.
  • Set Apache allowoverride to “none” to improve Apache performance in accessing files/directories.
  • Use a fast fileserver for serving static content (thttpd). static.mydomain.com, dynamic.mydomain.com
  • Serialize application settings like paths into an associative array and cache or serialize that array after first execution.
  • Use PHP output control buffering for page caching of heavilty accessed pages
  • Use PDO prepare over native db prepare for statements. mysql_attr_direct_query=>1
  • Do NOT use SQL wildcard select. eg. SELECT *
  • Avoid using SQL directive DISTINCT
  • Use database logic (queries, joins, views, procedures) over loopy PHP.
  • Use shortcut syntax for SQL inserts if not using PDO parameters parameters. eg. insert into sometable (field1,feild2) values ((“x”,”y”),(“p”,”q”));
  • Use Zend – it’s the best PHP library around

Comments on a postcard please.