Monday, July 31, 2017

Default parameterization of varchar to nvarchar in sql driver - Part 2

In this post I have described the default parameterization of varchar to nvarchar in JDBC driver and how to switch off that behavior. In the following post I am going to explain in detail about the checklist that you should do in order to set sendStringParametersAsUnicode to false.

Checklist (for sql server)

  • Check the database level collation
SELECT name, collation_name
FROM sys.databases
WHERE name = 'xyz';  --where xyz is the database name

  • There should be no nvarchar columns in your database
SELECT * FROM xyz.INFORMATION_SCHEMA.COLUMNS WHERE DATA_TYPE IN ('nvarchar', 'nchar');
  • The collation set for the varchar columns in your database should be the same as the database level collation
-- to view the table, column, data-type and collation details
SELECT OBJECT_NAME(c.object_id) as 'OBJECT NAME',
SCHEMA_NAME(t.schema_id) as 'SCHEMA NAME',
t.name as 'TABLE NAME',
y.name as 'DATA TYPE', c.* FROM xyz.sys.columns c
JOIN xyz.sys.tables t on c.object_id = t.object_id
JOIN sys.types y ON y.system_type_id = c.system_type_id
where c.collation_name is NOT NULL
-- check if there are any columns with collation different from that of the database
SELECT SCHEMA_NAME(t.schema_id) as 'SCHEMA NAME',
t.name as 'TABLE NAME',
y.name as 'DATA TYPE', c.* FROM xyz.sys.columns c
JOIN xyz.sys.tables t on c.object_id = t.object_id
JOIN sys.types y ON y.system_type_id = c.system_type_id
where c.collation_name  NOT IN  (SELECT collation_name
FROM sys.databases
WHERE name = 'xyz')

 Why concerned with collation?

This section is going to be long and will cover how we arrived at the checklist above

What is collation?

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.
Suppose that we have an alphabet with four letters: 'A', 'B', 'a', 'b'. We give each letter a number: 'A' = 0, 'B' = 1, 'a' = 2, 'b' = 3. The letter 'A' is a symbol, the number 0 is the encoding for 'A', and the combination of all four letters and their encodings is a character set.
Now, suppose that we want to compare two string values, 'A' and 'B'. The simplest way to do this is to look at the encodings: 0 for 'A' and 1 for 'B'. Because 0 is less than 1, we say 'A' is less than 'B'. Now, what we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): "compare the encodings." We call this simplest of all possible collations a binary collation.
But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters 'a' and 'b' as equivalent to 'A' and 'B'; (2) then compare the encodings. We call this a case-insensitive collation. It's a little more complex than a binary collation.
In real life, most character sets have many characters: not just 'A' and 'B' but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. Also in real life, most collations have many rules: not just case insensitivity but also accent insensitivity (an "accent" is a mark attached to a character as in German 'ö') and multiple-character mappings (such as the rule that 'ö' = 'OE' in one of the two German collations).

The collation can be Accent sensitive (AS) Accent Insensitive (AI) Case Sensitive (CS) and Case Insensitive (CI)

The following should give you a pretty good idea about this.


create table xyz..csas (name varchar(10) COLLATE Latin1_General_CS_AS)
insert into xyz..csas values ('Aries')
insert into xyz..csas values ('aries')
insert into xyz..csas values (N'áries')
select * from xyz..csas where name = 'Aries' --Aries
select * from xyz..csas where name = 'aries' --aries
select * from xyz..csas where name = N'áries' --áries

create table xyz..csai (name varchar(10) COLLATE Latin1_General_CS_AI)
insert into xyz..csai values ('Aries')
insert into xyz..csai values ('aries')
insert into xyz..csai values (N'áries')
select * from xyz..csai where name = 'Aries'  --Aries
select * from xyz..csai where name = 'aries' --aries, áries
select * from xyz..csai where name = N'áries' --aries, áries

create table xyz..ciai (name varchar(10) COLLATE Latin1_General_CI_AI)
insert into xyz..ciai values ('Aries')
insert into xyz..ciai values ('aries')
insert into xyz..ciai values (N'áries')
select * from xyz..ciai where name = 'Aries' --Aries,aries,áries
select * from xyz..ciai where name = 'aries' --Aries,aries,áries
select * from xyz..ciai where name = N'áries' --Aries,aries,áries

create table xyz..cias (name varchar(10) COLLATE Latin1_General_CI_AS)
insert into xyz..cias values ('Aries')
insert into xyz..cias values ('aries')
insert into xyz..cias values (N'áries')
select * from xyz..cias where name = 'Aries' --Aries,aries
select * from xyz..cias where name = 'aries' --Aries,aries
select * from xyz..cias where name = N'áries' --áries


Let me iterate over the steps through which I concluded with the checklist.


I have written test cases in our application that asserts the above query with and without sendStringParametersAsUnicode set.

I was expecting it to succeed for dataSourceWithUniCodeEnabled and fail for dataSourceWithUniCodeDisabled. But no, it was succeeding for both.

so when I am passing á in the code with the unicode disabled, how did it the string go as unicode?

This is where the database level collation matters. It was working fine because the collation at database level was also Latin1_General_CS_AS 

Confusing right? Let us take a step back.

I am creating a table with Latin1_General_CS_AS collation and varchar column. By definition, I cannot insert unicode. but why does the following work?

create table xyz..egvarchar (name varchar(10) COLLATE Latin1_General_CS_AS)
insert into  xyz..egvarchar  values (N'áries')
select * from xyz..egvarchar where name = N'áries' -- gives áries as output
This is because defining collation allows certain characters to be allowed even in the varchar column. To know what are supported in a collation,

;WITH AllNumbers AS
(
SELECT 1 AS Number
UNION ALL
SELECT Number+1
FROM AllNumbers
WHERE Number+1<256
)
SELECT Number AS ASCII_Value,CHAR(Number) COLLATE Latin1_General_CS_AI ASCII_Char FROM AllNumbers

OPTION (MAXRECURSION 256);


so if I try to insert a non-latin in the same table
insert into  xyz..egvarchar  values (N'áяies')
it gets inserted but when I select.
áries
á?ies --> note the question mark here. Meaning it does not support that character.
If I repeat the same exercise with the column as nvarchar, the output would be
create table xyz..egnvarchar (name nvarchar(10) COLLATE Latin1_General_CS_AS)
insert into  xyz..egnvarchar  values (N'áries')
insert into  xyz..egnvarchar  values (N'áяies')
select * from xyz..egnvarchar
OUTPUT:
áries
áяies
This is why it is important to make sure there are no nvarchar column before turning unicode off. Now back to our problem, the reason why my tests did not fail with sendStringParametersAsUnicode set to false was because á was supported because of the collation. What collation? The collation at the level you have defined your datasource.
<property name="url" value="jdbc:jtds:sqlserver://dataserver_name/xyz" />
 In the above example, it is the collation set to the database xyz. This is very important. Hence the checklist 3) collation of the columns in your database should be the same as the database level collation. To validate again, we created another database abc with collation Cyrillic_General_CI_AS Created 2 tables -one with Cyrillic_General_CI_AS collation and another with Latin1_General_CI_AS
create table abc..cyrillic (name varchar(10) COLLATE Cyrillic_General_CI_AS)
insert into abc..cyrillic values (N'Aяies')
select * from abc..cyrillic where name = N'Aяies'
create table abc..latin (name varchar(10) COLLATE Latin1_General_CI_AS)
insert into abc..latin values (N'áries')
select * from abc..latin where name = N'áries'
The table with collation set to Cyrillic will return results irrespective of the property sendStringParameterAsUnicode The table with collation set to Latin will return result only when sendStringParameterAsUnicode is set to true Meaning, the default behavior of the sql driver to consider whether something is unicode/not depends on the collation of the datasource. Note that it is very important at what level the url property is set. It should be set at the database level since you have done the pre-check against your database.
For Eg:
jdbc:jtds:sqlserver://dataserver_name/xyz -> right
jdbc:jtds:sqlserver://dataserver_name -> wrong (collation at dataserver and the database might be different)

 Code

<bean id="cyrillicDataSourceWithUniCodeDisabled" class="org.apache.tomcat.jdbc.pool.DataSource">
<property name="driverClassName" value="net.sourceforge.jtds.jdbc.Driver" />
<property name="url" value="jdbc:jtds:sqlserver://dataserver_name/abc" />
<property name="initialSize" value="2" />
<property name="maxActive"
value="<some_value>" />
<property name="maxWait"
value="<some_value>" />
<property name="removeAbandonedTimeout"
value="<some_value>" />
        <property name="minEvictableIdleTimeMillis" value="<some_value>"/>
        <property name="timeBetweenEvictionRunsMillis" value="<some_value>"/>
<property name="testOnBorrow" value="true" />
<property name="validationQuery">
<value>SELECT 1</value>
</property>
<property name="connectionProperties" value="sendStringParametersAsUnicode=false;"/>
</bean>
<bean id="cyrillicDataSourceWithUniCodeEnabled" class="org.apache.tomcat.jdbc.pool.DataSource">
<property name="driverClassName" value="net.sourceforge.jtds.jdbc.Driver" />
<property name="url" value="jdbc:jtds:sqlserver://dataserver_name/abc" />
<property name="initialSize" value="2" />
<property name="maxActive"
value="<some_value>" />
<property name="maxWait"
value="<some_value>" />
<property name="removeAbandonedTimeout"
value="<some_value>" />
        <property name="minEvictableIdleTimeMillis" value="<some_value>"/>
        <property name="timeBetweenEvictionRunsMillis" value="<some_value>"/>
<property name="testOnBorrow" value="true" />
<property name="validationQuery">
<value>SELECT 1</value>
</property
>
</bean>


 If you change the url of the datasource injected from
<property name="url" value="jdbc:jtds:sqlserver://dataserver_name/abc" >
    to
<property name="url" value="jdbc:jtds:sqlserver://dataserver_name/" />
 the dataserver's collation will be taken which is latin, only the test with unicode enabled will succeed here.



import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotNull;

import java.util.List;

import javax.sql.DataSource;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.jdbc.core.RowMapper;

public class CyrillicDatasourceUnicodeTest extends BaseDAOTest {

    private static final String CYRILLIC_ACCENT_CASE = "A" + "\u044F" + "ies";

    private static final String LATIN_ACCENT_CASE = "\u00E1" + "ries";

    @Autowired
    private DataSource cyrillicDataSourceWithUniCodeEnabled;

    @Autowired
    private DataSource cyrillicDataSourceWithUniCodeDisabled;

    @org.junit.Test
    public void testForCyrillicForUnicodeEnabled() {
        JdbcTemplate jdbcTemplate = new JdbcTemplate(cyrillicDataSourceWithUniCodeEnabled);
        String sql = "select name from abc..cyrillic where name = ?";
        RowMapper rowMapper = (RowMapper) (rs, rowNum) -> {
            return rs.getString(1);
        };

        List results = jdbcTemplate.query(sql, rowMapper, CYRILLIC_ACCENT_CASE);
        assertNotNull(results);
        System.out.println(results);
        assertEquals(1, results.size());
        assertEquals(CYRILLIC_ACCENT_CASE, results.get(0));

    }

    @org.junit.Test
    public void testForCyrillicForUnicodeDisabled() {
        JdbcTemplate jdbcTemplate = new JdbcTemplate(cyrillicDataSourceWithUniCodeDisabled);
        String sql = "select name from abc..cyrillic where name = ?";
        RowMapper rowMapper = (RowMapper) (rs, rowNum) -> {
            return rs.getString(1);
        };
        List results = jdbcTemplate.query(sql, rowMapper, CYRILLIC_ACCENT_CASE);
        assertNotNull(results);
        System.out.println(results);
        assertEquals(
                "supposed to match the result regardless of the sendStringParametersAsUnicode 
property because the datasource is also the same collation as the column",
                1, results.size());
        assertEquals(CYRILLIC_ACCENT_CASE, results.get(0));
    }

    @org.junit.Test
    public void testForLatinForUnicodeEnabled() {
        JdbcTemplate jdbcTemplate = new JdbcTemplate(cyrillicDataSourceWithUniCodeEnabled);
        String sql = "select name from abc..latin where name = ?";
        RowMapper rowMapper = (RowMapper) (rs, rowNum) -> {
            return rs.getString(1);
        };

        List results = jdbcTemplate.query(sql, rowMapper, LATIN_ACCENT_CASE);
        assertNotNull(results);
        System.out.println(results);
        assertEquals(1, results.size());
        assertEquals(LATIN_ACCENT_CASE, results.get(0));

    }

    @org.junit.Test
    public void testForLatinForUnicodeDisabled() {
        JdbcTemplate jdbcTemplate = new JdbcTemplate(cyrillicDataSourceWithUniCodeDisabled);
        String sql = "select name from abc..latin where name = ?";
        RowMapper rowMapper = (RowMapper) (rs, rowNum) -> {
            return rs.getString(1);
        };
        List results = jdbcTemplate.query(sql, rowMapper, LATIN_ACCENT_CASE);
        assertNotNull(results);
        System.out.println(results);
        assertEquals(
                "No results will come because unicode is disabled and the collation of the column (latin) 
is different from that of the datasource (cyrillic)",
                0, results.size());
    }

}

Thursday, July 13, 2017

ReflectionTestUtils to tweak private fields for unit tests

I was writing an integration test for create in bulk mode which inserts data in bulk if the number of records is over 100.

Now just in order to test this bulk insert feature I do not want to load 100 records. So I need to tweak this batchSize set to 100 to 1

public class DaoImpl {
private static Integer batchSize = 100;

@Autowired
private Dao dao;

public void test()
{
// get the default size
Integer defaultBatchSize = (Integer) ReflectionTestUtils.getField(dao, "batchSize");

// tweak the data
Integer batchSize = 1;
ReflectionTestUtils.setField(dao, "batchSize", batchSize, Integer.class);
assertEquals(batchSize, ReflectionTestUtils.getField(dao, "batchSize"));

// proceed with test

// revert it back to the default size
ReflectionTestUtils.setField(dao, "batchSize", defaultBatchSize, Integer.class);
}

ReflectionTestUtils.setField will internally take care of making your field accessible.


Use of TransactionProxyFactoryBean in Springs

Generally while writing integration test, we annotate with @TransactionConfiguration and set defaultRollback to true. The rollback helps when you do not want to explicitly delete the test data after every test case.

In a project I was working on, I happened to write an integration test for bulk insert. Bulk insert to the table is done by dumping the records in the stage table. There is a commons library in the firm that is used inorder to create this stage table.

I observed that the rollback was not happening only with the bulk insert test cases. At first I thought - we are creating stage table. which means there is a create statement executed. And since DDL statements are implicitly committed, we will not be able to rollback. But later I found that in sql server, DDLs are not implicitly committed

So when I actually went through the stageUtils source code, I found the cause was connection.commit() it was invoking and not the fact the statement was DDL, that is not causing the rollback.

How to fix it?

1) stageUtils is not something the project owns. Hence there is no way I can modify that.
2) For my transaction to rollback, I need to treat this stage table creation as a new transaction. The way you can achieve it, is by annotation stageUtils with @Transactional(propagation = Propagation.REQUIRES_NEW)

But note that I cannot do 2) because of 1)

Here is where TransactionProxyFactoryBean comes for rescue. This enabling creating proxy around the stageUtils and define required transaction attributes.

<bean id="stageUtils" class="com.foo.infra.util.StageUtils"
    scope="prototype"/>
<bean id="stageUtilsProxy"
    class="org.springframework.transaction.interceptor.TransactionProxyFactoryBean">
    <property name="transactionManager" ref="someTransactionManager" />
    <property name="target" ref="stageUtils" />
    <property name="transactionAttributes">
        <props>
            <prop key="*">PROPAGATION_REQUIRES_NEW</prop>
        </props>
    </property>
</bean>
<bean id="myDAO"
    class="com.foo.myproject.dao.someDAOImpl">
        <!-- BEFORE -->
        <!-- <property name="stageUtils" ref="stageUtils" /> -->
        <!-- AFTER -->
        <property name="stageUtils" ref="stageUtilsProxy" />
</bean>


@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration
@TransactionConfiguration(transactionManager = "someTransactionManager", defaultRollback = true)
@Transactional
public abstract class BaseIntegrationDAOTest

Propagation requires new will force the stageUtils to start a new transaction and hence it won't disturb your transaction rollback