Isaac Hands

Isaac Hands

Forum Replies Created

Viewing 15 posts - 1 through 15 (of 17 total)
  • Author
    Posts
  • in reply to: Sample XML Files #10693
    Isaac Hands
    Moderator

    I did some investigation and anyone that wants to obtain a synthetic v18 XML dataset for evaluation or training, containing about 500,000 records, can contact Recinda Sherman at NAACCR (rsherman at naaccr.org) with the following information in your email:

    1. Explanation of what the data will be used for and whether you want record type I or C – the data cannot be shared outside of your stated use.

    2. Filled out “Data Confidentiality Agreement for Researchers” document from this url: https://www.naaccr.org/irb-information-for-cina/#IRBFORMS

    This synthetic dataset is not just junk data, Recinda can explain the characteristics best, but the data values are meaningful with respect to the distribution of values from actual cancer datasets and have been appropriately anonymized.

    in reply to: question about patient/turmor model #10648
    Isaac Hands
    Moderator

    Thanks for the feedback, is this something where we could change the way the macro works – maybe it could operate on a directory of files instead of an individual file – or are you saying that using any macro XML conversion would be too difficult to use in SAS?

    in reply to: question about patient/turmor model #10645
    Isaac Hands
    Moderator

    I am not sure how to do this in the SEER data viewer, but if you want to maintain your current SAS-based workflow, you can flatten the XML file as either a pre-processing step or automatically using the SAS XML macro – the first option listed here:
    https://github.com/imsweb/naaccr-xml/wiki/7:-NAACCR-XML-and-SAS

    in reply to: Sample XML Files #10644
    Isaac Hands
    Moderator

    Kathleen, I assume you are talking about the sample files posted here:
    https://github.com/imsweb/naaccr-xml/wiki/2:-Sample-Data-Files

    Those are definitely “rudimentary” and we will update them to v18, but that may not be what you are looking for. If you want meaningful v18 XML data right now, you can convert an existing fixed-width v18 file into an XML file very easily using the NAACCR-XML Utility tool:
    https://github.com/imsweb/naaccr-xml/wiki/1:-NAACCR-XML-Utility-Tool

    Also, there is a NAACCR group that works on creating synthetic data and they created some nice synthetic XML data sets for the NAACCR Hackathon last year. I know that those particular datasets were bound by data-use agreements since they were based on real submission files – so we will explore having those released to a wider audience.

    in reply to: Editor Tool for XML files #10087
    Isaac Hands
    Moderator

    Bruce, the best general purpose XML editor I have found is Oxygen XML Editor (https://www.oxygenxml.com) which ended up costing us about $100 initial purchase and $22 / year for maintenance/updates. It will open .gz files natively and has a nice feature that will transform/change XML documents according to the XQuery standard (if you are willing to learn XQuery transformations): https://www.w3schools.com/xml/xquery_intro.asp
    Just to be clear, I do not use an XML editor for NAACCR XML since we parse all of our incoming XML files into a relational database before making changes.
    Also, I think you are following the other forums threads, but the NAACCR XML SAS macro seems to be working for people that need to handle XML natively in SAS: https://github.com/imsweb/naaccr-xml/wiki/7:-NAACCR-XML-and-SAS

    in reply to: Using SAS with NAACCR XML #9759
    Isaac Hands
    Moderator

    Bruce, thank you for this request and your description of the problems you may face with XML. In the NAACCR XML Workgroup we have been discussing the utility of a delimited file format for certain use cases, specifically related to compatibility in SAS and other statistical software. As these discussions are ongoing, we are all forming and re-forming our opinions on the matter so I am not sure I can present a coherent picture of where the discussions currently land until we have more discussion. (You are welcome to join our Workgroup at any time)

    On the subject of choosing a delimiter, we can define escaping rules for whatever delimiter we choose, so I am not worried about trying to guess whether a certain character will show up in the output or not.

    On the subject of header names, I am not a fan of defining another name for data items. We are currently working with UDS to harmonize the NAACCR “Short Name” list with the current XML naaccrIds so that we can reduce duplicate naming efforts in the NAACCR Community. This will probably involve shortening the Xml naaccrIds and agreeing on a standard way to generate stable names across versions.

    in reply to: Using SAS with NAACCR XML #7025
    Isaac Hands
    Moderator

    For SAS 9.4, this is the magic parameter you need to set in the JREOPTIONS of C:\Program Files\SASHome\SASFoundation\9.4\nls\en\sasv9.cfg:
    -Dsas.jre.libjvm=C:\Program Files\Java\jdk1.8.0_161\jre\bin\server\jvm.dll

    The instructions online about setting sas.jre.home are wrong, here is the complete JREOPTIONS setting from my sasv9.cfg file:

    /*  Options used when SAS is accessing a JVM for JNI processing  */
    -JREOPTIONS=(
    
            -DPFS_TEMPLATE=!SASROOT\tkjava\sasmisc\qrpfstpt.xml
            -Djava.class.path=C:\PROGRA~1\SASHome\SASVER~1\eclipse\plugins\SASLAU~1.JAR
            -Djava.security.auth.login.config=!SASROOT\tkjava\sasmisc\sas.login.config
            -Djava.security.policy=!SASROOT\tkjava\sasmisc\sas.policy
            -Djava.system.class.loader=com.sas.app.AppClassLoader
            -Dlog4j.configuration=file:/C:/Program%20Files/SASHome/SASFoundation/9.4/tkjava/sasmisc/sas.log4j.properties
            -Dsas.app.class.path=C:\PROGRA~1\SASHome\SASVER~1\eclipse\plugins\tkjava.jar
            -Dsas.ext.config=!SASROOT\tkjava\sasmisc\sas.java.ext.config
            -Dtkj.app.launch.config=C:\PROGRA~1\SASHome\SASVER~1\picklist
    	-Dsas.jre.libjvm=C:\Program Files\Java\jdk1.8.0_161\jre\bin\server\jvm.dll
            -Xms128m
            -Xmx128m
            )

    You probably know this, but don’t forget to set your environment variable CLASSPATH to point to your jar file.

    in reply to: Editor Tool for XML files #7012
    Isaac Hands
    Moderator

    OK, so the rapid reports are NAACCR records with a lot of empty/missing data? What record type do they use?

    And I assume the pathology/DC/clinic records are not a well defined format, but you use SAS to normalize them into a NAACCR record?

    in reply to: Using SAS with NAACCR XML #6989
    Isaac Hands
    Moderator

    I just used the existing Java library to read each patient as it occurred in the XML file, writing out an incremental “csvPatientId” number for each <Patient> element so that SAS would know where the unique patients were. If there is interest in this technique, I will post my Java code.

    As a fun exercise, I started this experiment by creating an Access Database instead of a CSV file from the XML, mostly because SAS has “native” support for Access databases, much better than SAS XML support, and Microsoft Access has been mentioned several times as a tool that some registries use. Unfortunately, I quickly ran into limitations of the Access database format, specifically the 2GB file size and the number of fields in a table:
    https://support.office.com/en-us/article/Access-specifications-0cf3c66f-9cf2-4e32-9568-98c1025bb47c
    From what I can tell, Access can load CSV files with some fiddling, so maybe this solution would help both Access and SAS users.

    in reply to: Using SAS with NAACCR XML #6985
    Isaac Hands
    Moderator

    Following up on my last post to this thread…
    I wonder if using a CSV formatted file would be a good intermediary between SAS and XML? The CSV format doesn’t suffer from many of the same limitations as the fixed-width file, such as needing to know the position and length of all variables beforehand, so translating between XML and CSV will not require maintenance of Volume II metadata to go along with every NAACCR Item. CSV will still be limited for conveying multi-tier data, such as Patient/Tumor/etc., but SAS does not understand multi-tier data models anyway, so maybe that’s OK for this use case.
    I have been playing around with some Java code running inside SAS that can generate CSV from NAACCR XML and then load the data as a SAS dataset. So far, it looks promising, it takes about 4.5 minutes to load a 6GB NAACCR XML file into a SAS dataset with this method, using a pretty basic Windows 10 desktop computer, not sure if that will be acceptable, but it might make a nice proof of concept.
    Here is what the SAS code looks like:

    filename xmlfile 'C:\\Users\\isaac\\Documents\\ky9515v16.xml';
    filename csvfile 'C:\\Users\\isaac\\Documents\\ky9515v16.csv';
    
    data _null_;
    	do;
    		declare JavaObj j1 ("edu/uky/kcr/naaccrxml/csv/ConvertXmlToCsv", xmlfile, csvfile); 
    		j1.callVoidMethod ("convert");
    		j1.delete();
    	end;
    run;
    
    proc import datafile=csvfile
    	out=fromcsv
    	dbms=csv;
    	getnames=yes;
    run;

    The Java code behind this is using the Java NAACCR XML library from IMS

    in reply to: Using SAS with NAACCR XML #6955
    Isaac Hands
    Moderator

    I am not ready to give up on SAS yet.
    It sounds like your suggestions above are along the same lines of what I have been wondering: If SAS doesn’t really support large XML data files, is there an intermediate format that SAS could use instead of the XML directly?
    For example, here is a list of all “DBMS” formats that SAS can use natively in the PROC IMPORT function:
    https://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#a003094743.htm

    Is it possible/probable/straightforward to create some sort of library that SAS can call directly (Java, Python, etc.) that will take an XML file, create one of these intermediate formats, and then load into SAS datasets so that the rest of SAS is happy?

    in reply to: Editor Tool for XML files #6954
    Isaac Hands
    Moderator

    Bruce, thank you for the specific requirements you outlined above. I am curious what database you are using at your registry? In b) and c) of your original post you wrote about the need to load XML data into a registry database, can you tell me what database technology you are using?

    in reply to: Using SAS with NAACCR XML #6719
    Isaac Hands
    Moderator

    Bruce,
    Thank you for trying out the XML Mapper in SAS, we will discuss the issues you bring up in our XML Workgroup calls and post on this thread when we have some ideas of how to move forward.
    The tasks you are describing are straightforward in a language that has first class support for XML such as Python, Java, C#, etc., but we are still trying to figure out the best way to deal with large XML files in SAS that need hundreds of variables. Writing out XML files seems to be an afterthought in SAS, so we are still trying to get to the bottom of that issue as well. As you probably know, SAS is a very expensive piece of software, many of us on the XML Workgroup are not as familiar with it as we need to be, so it is taking longer to find solutions to these problems than with our currently published NAACCR XML software tools and libraries, but we are working on it.
    -Isaac

    in reply to: Using SAS with NAACCR XML #6677
    Isaac Hands
    Moderator

    Valerie,
    Please post your results when you get a chance to try out the XML code in SAS, I know others in the NAACCR Community will be interested in your experience.
    You mentioned using a different tool to convert SAS output to NAACCR XML, possibly Python, and I noticed that there is an “official” way to access SAS datasets from Python with this library: https://github.com/sassoftware/saspy
    If someone created a Python script to create NAACCR XML from a SAS dataset, would that be something you would be interested in?

    in reply to: Using SAS with NAACCR XML #6649
    Isaac Hands
    Moderator

    This sounds like a great way to do it, thank you for posting the files and explanation. I wonder what is the best way to get other NAACCR Community SAS users to try it out and give feedback.

Viewing 15 posts - 1 through 15 (of 17 total)

Copyright © 2018 NAACCR, Inc. All Rights Reserved | naaccr-swoosh-only See NAACCR Partners and Sponsors