I need to convert an XML file in the IOB format.
The XML file represents the structure of a Latex-written paper, i.e. with sections and subsections. In this representation, sections are encoded as BODY, then I have a HEADER and then paragraphs or subsections.
Example:
<DIV DEPTH="1">
<HEADER ID="H-8"> Practical Results </HEADER>
<P TYPE="TXT">
<S ID="S-56" TYPE="TXT"> To assess its performance , <REF REFID="R-12" ID="C-36">Grover et al. 1993</REF> tried various methods . </S>
<S ID="S-57" TYPE="TXT"> The grammar is defined in metagrammatical formalism which is compiled into a unification-based ` object grammar ' -- a syntactic variant of the Definite Clause Grammar formalism <REF REFID="R-21" ID="C-37">Pereira and Warren 1980</REF> -- containing 84 features and 782 phrase structure rules . </S>
<DIV DEPTH="2">
<HEADER ID="H-9"> Comparing the Parsers </HEADER>
<P TYPE="TXT">
<S ID="S-61" TYPE="TXT"> In the first experiment , the ANLT grammar was loaded and a set of sentences was input to each of the three parsers . </S>
</P>
<IMAGE ID="I-0"/>
</DIV>
What I want to do is to keep all the text, but convert it to a different format, i.e. I want to remove the BODY structure, and just tag the HEADERs and the text part like this:
Practical/B-Header Results/I-Header ./O
To/B-Text assess/I-Text its/I-Text performance/I-Text ,/I-Text Grover/I-Text et/I-Text al./I-Text tried/I-Text various/I-Text methods/I-Text ./O
The/B-Text grammar/I-Text ... ./O
And so on.
I know some DOM parsing in Java (for example, I have been using jdom2 for a little while) but I do not know how to keep the order of the text: for example, I want to fetch the content of the REF tag (which is inside S, look at the example), but the text from its parent extends before and after the REF tag.
Any pointers? Should be fairly simple, but searches like "strip XML tags after certain depth" did not help me :-(
No comments:
Post a Comment