Rolf Schumacher
2017-04-07 10:20:45 UTC
I am using Saxon-HE-9.7.0-15.jar and I am about to create a
transformation in order to anonymize the input.
As a first step I was looking for all distinct words in the input and
came across a behavior that I do not comprehend.
I was not sure whether it speeds up to use mode keyword with templates
or not and came across a result that puzzles me.
I boiled it down to this transformation rules:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="fn"
exclude-result-prefixes="xs fn">
<xsl:output method="xml" encoding="UTF-8"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:variable name="allwords" as="xs:string+">
<xsl:apply-templates select="*" mode="lookup"/>
</xsl:variable>
<xsl:variable name="words" select="distinct-values($allwords)"/>
<root>
<xsl:attribute name="allwords" select="count($allwords)"/>
<xsl:attribute name="words" select="count($words)"/>
</root>
</xsl:template>
<xsl:template match="*">
<xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" />
<xsl:apply-templates select="*" />
</xsl:template>
<xsl:template match="*" mode="lookup">
<xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" />
<xsl:apply-templates select="*" />
</xsl:template>
</xsl:stylesheet>
For a certain input (~30MB) this led to the result:
<?xml version="1.0" encoding="UTF-8"?><root allwords="696831" words="7617"/>
However, commenting the second template out, I get a different result
from the very same input:
<?xml version="1.0" encoding="UTF-8"?><root allwords="531375" words="7620"/>
To make it very clear, here are the transformation rules for the second
results:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="fn"
exclude-result-prefixes="xs fn">
<xsl:output method="xml" encoding="UTF-8"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:variable name="allwords" as="xs:string+">
<xsl:apply-templates select="*" mode="lookup"/>
</xsl:variable>
<xsl:variable name="words" select="distinct-values($allwords)"/>
<root>
<xsl:attribute name="allwords" select="count($allwords)"/>
<xsl:attribute name="words" select="count($words)"/>
</root>
</xsl:template>
<!-- <xsl:template match="*"> -->
<!-- <xsl:value-of
select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" /> -->
<!-- <xsl:apply-templates select="*" /> -->
<!-- </xsl:template> -->
<xsl:template match="*" mode="lookup">
<xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" />
<xsl:apply-templates select="*" />
</xsl:template>
</xsl:stylesheet>
Question: what is the semantic difference between the two transformation
rules that could explain the difference in the result?
Kind Regards
Rolf
transformation in order to anonymize the input.
As a first step I was looking for all distinct words in the input and
came across a behavior that I do not comprehend.
I was not sure whether it speeds up to use mode keyword with templates
or not and came across a result that puzzles me.
I boiled it down to this transformation rules:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="fn"
exclude-result-prefixes="xs fn">
<xsl:output method="xml" encoding="UTF-8"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:variable name="allwords" as="xs:string+">
<xsl:apply-templates select="*" mode="lookup"/>
</xsl:variable>
<xsl:variable name="words" select="distinct-values($allwords)"/>
<root>
<xsl:attribute name="allwords" select="count($allwords)"/>
<xsl:attribute name="words" select="count($words)"/>
</root>
</xsl:template>
<xsl:template match="*">
<xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" />
<xsl:apply-templates select="*" />
</xsl:template>
<xsl:template match="*" mode="lookup">
<xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" />
<xsl:apply-templates select="*" />
</xsl:template>
</xsl:stylesheet>
For a certain input (~30MB) this led to the result:
<?xml version="1.0" encoding="UTF-8"?><root allwords="696831" words="7617"/>
However, commenting the second template out, I get a different result
from the very same input:
<?xml version="1.0" encoding="UTF-8"?><root allwords="531375" words="7620"/>
To make it very clear, here are the transformation rules for the second
results:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="fn"
exclude-result-prefixes="xs fn">
<xsl:output method="xml" encoding="UTF-8"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:variable name="allwords" as="xs:string+">
<xsl:apply-templates select="*" mode="lookup"/>
</xsl:variable>
<xsl:variable name="words" select="distinct-values($allwords)"/>
<root>
<xsl:attribute name="allwords" select="count($allwords)"/>
<xsl:attribute name="words" select="count($words)"/>
</root>
</xsl:template>
<!-- <xsl:template match="*"> -->
<!-- <xsl:value-of
select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" /> -->
<!-- <xsl:apply-templates select="*" /> -->
<!-- </xsl:template> -->
<xsl:template match="*" mode="lookup">
<xsl:value-of select="tokenize(text(),'[^A-Za-z0-9äöüßÄÖÜ]+')" />
<xsl:apply-templates select="*" />
</xsl:template>
</xsl:stylesheet>
Question: what is the semantic difference between the two transformation
rules that could explain the difference in the result?
Kind Regards
Rolf