Tryst with Technology: July 2017

In this post I have described the default parameterization of varchar to nvarchar in JDBC driver and how to switch off that behavior. In the following post I am going to explain in detail about the checklist that you should do in order to set sendStringParametersAsUnicode to false.

Checklist (for sql server)

Check the database level collation

SELECT name, collation_name
FROM sys.databases
WHERE name = 'xyz'; --where xyz is the database name

There should be no nvarchar columns in your database

SELECT * FROM xyz.INFORMATION_SCHEMA.COLUMNS WHERE DATA_TYPE IN ('nvarchar', 'nchar');

The collation set for the varchar columns in your database should be the same as the database level collation

-- to view the table, column, data-type and collation details
SELECT OBJECT_NAME(c.object_id) as 'OBJECT NAME',
SCHEMA_NAME(t.schema_id) as 'SCHEMA NAME',
t.name as 'TABLE NAME',
y.name as 'DATA TYPE', c.* FROM xyz.sys.columns c
JOIN xyz.sys.tables t on c.object_id = t.object_id
JOIN sys.types y ON y.system_type_id = c.system_type_id
where c.collation_name is NOT NULL

-- check if there are any columns with collation different from that of the database
SELECT SCHEMA_NAME(t.schema_id) as 'SCHEMA NAME',
t.name as 'TABLE NAME',
y.name as 'DATA TYPE', c.* FROM xyz.sys.columns c
JOIN xyz.sys.tables t on c.object_id = t.object_id
JOIN sys.types y ON y.system_type_id = c.system_type_id
where c.collation_name NOT IN (SELECT collation_name
FROM sys.databases
WHERE name = 'xyz')

Why concerned with collation?

This section is going to be long and will cover how we arrived at the checklist above

What is collation?

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.
Suppose that we have an alphabet with four letters: 'A', 'B', 'a', 'b'. We give each letter a number: 'A' = 0, 'B' = 1, 'a' = 2, 'b' = 3. The letter 'A' is a symbol, the number 0 is the encoding for 'A', and the combination of all four letters and their encodings is a character set.
Now, suppose that we want to compare two string values, 'A' and 'B'. The simplest way to do this is to look at the encodings: 0 for 'A' and 1 for 'B'. Because 0 is less than 1, we say 'A' is less than 'B'. Now, what we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): "compare the encodings." We call this simplest of all possible collations a binary collation.
But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters 'a' and 'b' as equivalent to 'A' and 'B'; (2) then compare the encodings. We call this a case-insensitive collation. It's a little more complex than a binary collation.
In real life, most character sets have many characters: not just 'A' and 'B' but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules: not just case insensitivity but also accent insensitivity (an "accent" is a mark attached to a character as in German 'ö') and multiple-character mappings (such as the rule that 'ö' = 'OE' in one of the two German collations).

The collation can be Accent sensitive (AS) Accent Insensitive (AI) Case Sensitive (CS) and Case Insensitive (CI)

The following should give you a pretty good idea about this.

create table xyz..csas (name varchar(10) COLLATE Latin1_General_CS_AS)
insert into xyz..csas values ('Aries')
insert into xyz..csas values ('aries')
insert into xyz..csas values (N'áries')
select * from xyz..csas where name = 'Aries' --Aries
select * from xyz..csas where name = 'aries' --aries
select * from xyz..csas where name = N'áries' --áries

create table xyz..csai (name varchar(10) COLLATE Latin1_General_CS_AI)
insert into xyz..csai values ('Aries')
insert into xyz..csai values ('aries')
insert into xyz..csai values (N'áries')
select * from xyz..csai where name = 'Aries' --Aries
select * from xyz..csai where name = 'aries' --aries, áries
select * from xyz..csai where name = N'áries' --aries, áries

create table xyz..ciai (name varchar(10) COLLATE Latin1_General_CI_AI)
insert into xyz..ciai values ('Aries')
insert into xyz..ciai values ('aries')
insert into xyz..ciai values (N'áries')
select * from xyz..ciai where name = 'Aries' --Aries,aries,áries
select * from xyz..ciai where name = 'aries' --Aries,aries,áries
select * from xyz..ciai where name = N'áries' --Aries,aries,áries

create table xyz..cias (name varchar(10) COLLATE Latin1_General_CI_AS)
insert into xyz..cias values ('Aries')
insert into xyz..cias values ('aries')
insert into xyz..cias values (N'áries')
select * from xyz..cias where name = 'Aries' --Aries,aries
select * from xyz..cias where name = 'aries' --Aries,aries
select * from xyz..cias where name = N'áries' --áries

Let me iterate over the steps through which I concluded with the checklist.

I have written test cases in our application that asserts the above query with and without sendStringParametersAsUnicode set.

I was expecting it to succeed for dataSourceWithUniCodeEnabled and fail for dataSourceWithUniCodeDisabled. But no, it was succeeding for both.

so when I am passing á in the code with the unicode disabled, how did it the string go as unicode?

This is where the database level collation matters. It was working fine because the collation at database level was also Latin1_General_CS_AS

Confusing right? Let us take a step back.

I am creating a table with Latin1_General_CS_AS collation and varchar column. By definition, I cannot insert unicode. but why does the following work?

create table xyz..egvarchar (name varchar(10) COLLATE Latin1_General_CS_AS)
insert into xyz..egvarchar values (N'áries')
select * from xyz..egvarchar where name = N'áries' -- gives áries as output

This is because defining collation allows certain characters to be allowed even in the varchar column. To know what are supported in a collation,

;WITH AllNumbers AS
(
SELECT 1 AS Number
UNION ALL
SELECT Number+1
FROM AllNumbers
WHERE Number+1<256
)
SELECT Number AS ASCII_Value,CHAR(Number) COLLATE Latin1_General_CS_AI ASCII_Char FROM AllNumbers

OPTION (MAXRECURSION 256);

so if I try to insert a non-latin in the same table

insert into xyz..egvarchar values (N'áяies')

it gets inserted but when I select.

áries
á?ies --> note the question mark here. Meaning it does not support that character.

If I repeat the same exercise with the column as nvarchar, the output would be

create table xyz..egnvarchar (name nvarchar(10) COLLATE Latin1_General_CS_AS)
insert into xyz..egnvarchar values (N'áries')
insert into xyz..egnvarchar values (N'áяies')
select * from xyz..egnvarchar

OUTPUT:

áries

áяies

This is why it is important to make sure there are no nvarchar column before turning unicode off. Now back to our problem, the reason why my tests did not fail with sendStringParametersAsUnicode set to false was because á was supported because of the collation. What collation? The collation at the level you have defined your datasource.

<property name="url" value="jdbc:jtds:sqlserver://dataserver_name/xyz" />

In the above example, it is the collation set to the database xyz. This is very important. Hence the checklist 3) collation of the columns in your database should be the same as the database level collation. To validate again, we created another database abc with collation Cyrillic_General_CI_AS Created 2 tables -one with Cyrillic_General_CI_AS collation and another with Latin1_General_CI_AS

create table abc..cyrillic (name varchar(10) COLLATE Cyrillic_General_CI_AS)
insert into abc..cyrillic values (N'Aяies')
select * from abc..cyrillic where name = N'Aяies'

create table abc..latin (name varchar(10) COLLATE Latin1_General_CI_AS)
insert into abc..latin values (N'áries')
select * from abc..latin where name = N'áries'

The table with collation set to Cyrillic will return results irrespective of the property sendStringParameterAsUnicode The table with collation set to Latin will return result only when sendStringParameterAsUnicode is set to true Meaning, the default behavior of the sql driver to consider whether something is unicode/not depends on the collation of the datasource. Note that it is very important at what level the url property is set. It should be set at the database level since you have done the pre-check against your database.

For Eg:
jdbc:jtds:sqlserver://dataserver_name/xyz -> right
jdbc:jtds:sqlserver://dataserver_name -> wrong (collation at dataserver and the database might be different)

Code

<bean id="cyrillicDataSourceWithUniCodeDisabled" class="org.apache.tomcat.jdbc.pool.DataSource">
<property name="driverClassName" value="net.sourceforge.jtds.jdbc.Driver" />
<property name="url" value="jdbc:jtds:sqlserver://dataserver_name/abc" />
<property name="initialSize" value="2" />
<property name="maxActive"
value="<some_value>" />
<property name="maxWait"
value="<some_value>" />
<property name="removeAbandonedTimeout"
value="<some_value>" />
<property name="minEvictableIdleTimeMillis" value="<some_value>"/>
<property name="timeBetweenEvictionRunsMillis" value="<some_value>"/>
<property name="testOnBorrow" value="true" />
<property name="validationQuery">
<value>SELECT 1</value>
</property>
<property name="connectionProperties" value="sendStringParametersAsUnicode=false;"/>
</bean>
<bean id="cyrillicDataSourceWithUniCodeEnabled" class="org.apache.tomcat.jdbc.pool.DataSource">
<property name="driverClassName" value="net.sourceforge.jtds.jdbc.Driver" />
<property name="url" value="jdbc:jtds:sqlserver://dataserver_name/abc" />
<property name="initialSize" value="2" />
<property name="maxActive"
value="<some_value>" />
<property name="maxWait"
value="<some_value>" />
<property name="removeAbandonedTimeout"
value="<some_value>" />
<property name="minEvictableIdleTimeMillis" value="<some_value>"/>
<property name="timeBetweenEvictionRunsMillis" value="<some_value>"/>
<property name="testOnBorrow" value="true" />
<property name="validationQuery">
<value>SELECT 1</value>
</property>
</bean>

If you change the url of the datasource injected from
<property name="url" value="jdbc:jtds:sqlserver://dataserver_name/abc" >
to
<property name="url" value="jdbc:jtds:sqlserver://dataserver_name/" />
the dataserver's collation will be taken which is latin, only the test with unicode enabled will succeed here.

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotNull;

import java.util.List;

import javax.sql.DataSource;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.jdbc.core.RowMapper;

public class CyrillicDatasourceUnicodeTest extends BaseDAOTest {

    private static final String CYRILLIC_ACCENT_CASE = "A" + "\u044F" + "ies";

    private static final String LATIN_ACCENT_CASE = "\u00E1" + "ries";

    @Autowired
    private DataSource cyrillicDataSourceWithUniCodeEnabled;

    @Autowired
    private DataSource cyrillicDataSourceWithUniCodeDisabled;

    @org.junit.Test
    public void testForCyrillicForUnicodeEnabled() {
        JdbcTemplate jdbcTemplate = new JdbcTemplate(cyrillicDataSourceWithUniCodeEnabled);
        String sql = "select name from abc..cyrillic where name = ?";
        RowMapper rowMapper = (RowMapper) (rs, rowNum) -> {
            return rs.getString(1);
        };

        List results = jdbcTemplate.query(sql, rowMapper, CYRILLIC_ACCENT_CASE);
        assertNotNull(results);
        System.out.println(results);
        assertEquals(1, results.size());
        assertEquals(CYRILLIC_ACCENT_CASE, results.get(0));

    }

    @org.junit.Test
    public void testForCyrillicForUnicodeDisabled() {
        JdbcTemplate jdbcTemplate = new JdbcTemplate(cyrillicDataSourceWithUniCodeDisabled);
        String sql = "select name from abc..cyrillic where name = ?";
        RowMapper rowMapper = (RowMapper) (rs, rowNum) -> {
            return rs.getString(1);
        };
        List results = jdbcTemplate.query(sql, rowMapper, CYRILLIC_ACCENT_CASE);
        assertNotNull(results);
        System.out.println(results);
        assertEquals(
                "supposed to match the result regardless of the sendStringParametersAsUnicode 
property because the datasource is also the same collation as the column",
                1, results.size());
        assertEquals(CYRILLIC_ACCENT_CASE, results.get(0));
    }

    @org.junit.Test
    public void testForLatinForUnicodeEnabled() {
        JdbcTemplate jdbcTemplate = new JdbcTemplate(cyrillicDataSourceWithUniCodeEnabled);
        String sql = "select name from abc..latin where name = ?";
        RowMapper rowMapper = (RowMapper) (rs, rowNum) -> {
            return rs.getString(1);
        };

        List results = jdbcTemplate.query(sql, rowMapper, LATIN_ACCENT_CASE);
        assertNotNull(results);
        System.out.println(results);
        assertEquals(1, results.size());
        assertEquals(LATIN_ACCENT_CASE, results.get(0));

    }

    @org.junit.Test
    public void testForLatinForUnicodeDisabled() {
        JdbcTemplate jdbcTemplate = new JdbcTemplate(cyrillicDataSourceWithUniCodeDisabled);
        String sql = "select name from abc..latin where name = ?";
        RowMapper rowMapper = (RowMapper) (rs, rowNum) -> {
            return rs.getString(1);
        };
        List results = jdbcTemplate.query(sql, rowMapper, LATIN_ACCENT_CASE);
        assertNotNull(results);
        System.out.println(results);
        assertEquals(
                "No results will come because unicode is disabled and the collation of the column (latin) 
is different from that of the datasource (cyrillic)",
                0, results.size());
    }

}

Tryst with Technology

Monday, July 31, 2017

Default parameterization of varchar to nvarchar in sql driver - Part 2

Checklist (for sql server)

Why concerned with collation?

Code

Thursday, July 13, 2017

ReflectionTestUtils to tweak private fields for unit tests

Use of TransactionProxyFactoryBean in Springs