batch parenthesis matching

Thu Aug 6 21:02:30 IDT 2009

Erez D <erez0001 at gmail.com> writes:

> hi
> i have an html file with few different instances of:
> <span class="myclass">
> ... some html, e.g. <B> blah blah <a href=....> </a> </b>
> </span>
> i want to remove theses instances.
> ( the html inside the <span> varies between instances, and there is a non
> constant number of instances)
> i thought of replacing '<[^/]' (i.e. '<' folowed by somthing else then '/' )
> with '{' and '</' with '}' and then doing parenthesis matching
> however i need it done automatically in batch. (i can do parenthesis matching
> in vi. can i do this in sed ?)

Sed is line-oriented which will make it a bit difficult.

If I understand you correctly, and you want to remove everything
between "<span" and "span>" including the span tags themselves, *and*
the file does not contain the span tags in comments or string literals
or anything like that, *and* "<span" always has a matching "span>",
then one way to do it would be

$ awk 'BEGIN {RS="(<span|span>)"} NR%2==1' <filename>

which will consider either "<span" or "span>" as a record separator
and will print only the odd records (everything between "<span" and
"span>" will be even records and will be skipped).

All you need to know about awk is that it splits the input into
records, RS is the record separator (set to a regexp in the
beginning), and NR is the number of the current record. It prints the
records matching the "odd NR" condition.

Does this do what you want?

-- 
Oleg Goldshmidt | pub at goldshmidt.org