Need to build an interface for XML to SQL Datadata

Home Forums NAACCR XML Standard Need to build an interface for XML to SQL Datadata

Viewing 15 posts - 16 through 30 (of 38 total)
  • Author
    Posts
  • #7469
    AnonymousFabian Depry
    Moderator

    Hi Jeff, You can use the SEER Data Generator to create large NAACCR XML files. Only core data items will have a value (about 40 of 50 of them) so it’s not perfect (for example, it would be nice to have text for Abstracts). But it’s better than nothing 🙂

    You can download it from this release page: https://github.com/imsweb/data-generator/releases

    Look for the “data-generator-X.X-all.jar” where “X.X” is the latest release.

    This a standalone JAR, so you should be able to just download it and double-click it (assuming you have a Java environment installed on your machine).

    You can find information about the data generator itself and what variables it computes on the project’s home page: https://github.com/imsweb/data-generator

    #7470
    AnonymousJeff Reed
    Spectator

    Brought a tear to my eye to be able to generate 5,000 record sample files for V16,V18 fixed and xml to include my test facility_id and home state of IL. Now for some real testing …

    #7471
    AnonymousFabian Depry
    Moderator

    As long as those are tears of joy! 🙂

    Good luck with your testing.

    #7472
    Rich Pinder
    Moderator

    Hi all.
    Catching up a bit on last few posts, Jeff you’ve been busy!

    Having realistic and cross’variable logical data to process in useful testing is important for sure, and something another NAACCR WG has been working on for a while. Fabian/Isaac used an initial dataset from this group in the June HackAThon – work does continue in that group. I’ll check to see the status and get back to all.

    You mention that you’re next steps are looking to test the I record ? Looking forward to DB loader tool needs for XML, I think some will want to have the ability to include all record types. Will the tools be able to handle A records??

    Also..in your C# program, are you unpacking the XML to a SINGLE file (usin the KB delim!) … or into SEPERATE files, based on the relations? I was envisioning this temporary/intermediary approach (xml->flat->sqlinsert) might grow more complex as a future XML spec from NAACCR might (will?) move past consolidated data, and into fully nested processing data structures (pat->tum->admission, pat->genetic tests, etc). If that move does happen, would a single file still allow for efficient bulk loads ??

    aok… this is really encouraging to see, and a great help to the community!
    take care
    r

      Identify/create better testing data. The Current XML sample file is a NAACCR version 140 ‘A’ record type. I need to test 160 and 180 versions for record type I.

      #7475
      AnonymousJeff Reed
      Spectator

      Thank you Rich,

      The data generation tool Fabian pointed me to in the github community looks like it did a good job of generating test files, (haven’t used the files I generated yet). The tool populates 40-50 field variables so it is a good source to test basic load functionality but shy of being able to use for timing on a fully populated record load test.

      Though my needs are only for processing ‘I’ records, the tool will be able to handle all record types. I am currently working on incorporating a user customized XML reference file to define the fields to extract into the load file.

      I am handling the data as one flat file with each row representing a case/tumor. Patient/header data will be repeated on each row. Much of the validation and breakdown of the data will be done in the DB. I still see the database needed to check for duplicates and generate aggregate totals.

      Building multiple files based on a relational data model is not envisioned at this point but I could see where that would be useful if there is a global data model of the data stores to match the XML. Generic SQL to generate schemas could be a component of that effort. Relationally we consider a unique case/tumor key on the facility_id, Accession_nbr and sequence_nbr. Identification of the patient is a different story which I am sure there is a healthy thread out there somewhere…

      A correlated side project will be to use the EDIT50.DLL from the CDC to apply edit checks/scoring for each tumor record. This will be incorporated into the NAACCR XML file parser I am building (looking for a name for this beast). This process would also create KB delimited file(s) for use in bulk loading.

      Thank you for all your feedback

      J

      #7479
      AnonymousJeff Reed
      Spectator

      Just noticed the test files from the github sample file generator did not include the <naaccr num> tag under the <item> tag where as the sample file example on the NAACC sample file did. is this <naaccr num> tag going to required or do I have to use the full name in the <naaccr id> tag to field map?.

      <Item naaccrId=”patientIdNumber” naaccrNum=”20″>01200001</Item>
      vs
      <Item naaccrId=”patientIdNumber”>00000001</Item>

      #7481
      AnonymousFabian Depry
      Moderator

      The standard itself doesn’t require the “naaccrNum” attribute because technically the “naaccrId” is all that is needed to uniquely identify an item. But a lot of software still deal with the numbers, and so it’s convenient to have them.

      I will update the data generator to allow an option to add the numbers to the created file (https://github.com/imsweb/data-generator/issues/30).

      Ultimately, it’s up to you if you want your new framework to require the numbers on incoming data (which is probably more convenient for you), or if you want to follow the strict standard and deal with not always having those numbers.

      #7482
      AnonymousJeff Reed
      Spectator

      Thank you Fabian, I guessed wrong, will switch to the required field <Item naaccrId=> I was hoping naaccrNum would be required so I didn’t have to worry about the full name match.

      Is the best source to use for mapping the version_id to the valid xml name the “naaccr-dictionary-180.xml” file?

      #7484
      AnonymousFabian Depry
      Moderator

      Yup, the “naaccr-dictionary-180.xml” is the best source for mapping NAACCR numbers to NAACCR IDs. At least for standard items. For non-standard items, that mapping should be provide in a “user-defined dictionary” by whoever created the XML data files (the data-generator doesn’t support user-defined dictionaries; I wanted to keep things very simple for now).

      #7563
      AnonymousJeff Reed
      Spectator

      Cant seem to get the XMLPLUS DLL function: XMLPlus_GetItemDataByNaaccrId(const int XmlId,const char* naaccrId) working

      I was able to get the function: XMLPlus_GetItemDataByNaaccrNum(const int XmlId,const int naaccrNum) working

      the function …bynaaccrid doesn’t seem to like the pointer I try to set. the function I got working does not use a pointer and just passed the numeric ID value. Is it reasonable to require the NAACCR Number in the data? That would be my preferred solution rather than using the shortnameID.

      #7580

      Hi Jeff,

      Cant seem to get the XMLPLUS DLL function: XMLPlus_GetItemDataByNaaccrId(const int XmlId,const char* naaccrId) working

      First, what are the return values of your calls into these functions? If any is anything except the all-is-well 0 (zero), you should use the functions in the chapter Handling System Errors to get better information.

      Let’s start with that, and if needed we can dig into your code. Please note, however, that my language is C++. If we end up in some technical area of C#, we’re going to need to enlist the help of somebody with expertise in that area.

      Is it reasonable to require the NAACCR Number in the data?

      FWIW, XMLPlus.dll always writes the data including both the naaccrId and the naaccrNum, because it makes editing the data more efficient. As to being able to require the naaccrNum, that probably depends upon how much clout you have with your reporting entities. Maybe you can make it a rule for your region?

      Kathleen

      #9636

      Perhaps the byNaccrrid function will not work in C#?

      Oh, it will work all right. I just have to enlist the help of a C# programmer at the CDC to look at it. Dollars to donuts it will be the something about the declarations.

      The help file says the C# declaration for this function should look something like this:

      [DllImportAttribute("XMLPlus.dll", EntryPoint = "XMLPlus_GetItemDataByNaaccrId")]
      public static extern int XMLPlus_GetItemDataByNaaccrId(int XmlId,
         [InAttribute()] [MarshalAsAttribute(UnmanagedType.LPStr)] string naaccrId);

      (I just typed that from the help file, so please look to it there.) My hunch is that your declaration is not properly converting a C# string to a pointer to array of char, which of course is what you need for a C interface across languages. In the meantime, I’ll see if I can get somebody else to write a test for these functions at the CDC.

      Kathleen

      #9657

      Jeff,

      Would you mind running a test for me? You said you were able to get your code to retrieve data values using the “byNaaccrNum” function. Could you run the loop so that you first call “byNaaccrNum” and then call “byNaaccrId”? I’m trying to narrow the problem down to support my hunch about the declarations, and need to know that all of the other functions are working as advertised.

      So the pseudo-code would look like this (note that all of these calls should capture and test the return value… this is just pseudo-code for program flow):

      XMLPlus_Initialize(...)
      XMLPlus_OpenXmlDataFile(...)
      
      /* while not end-of-file */
      {
          XMLPlus_ReadNextPatient(...)
      
          XMLPlus_GetPatientTumorsCount(...)
      
          for (<iterate over tumors for current patient>)
          {
              XMLPlus_ReadTumor(...)
      
              for (<iterate list of data items you want>)
              {
                   XMLPlus_GetItemDataByNaaccrNum(...)
                   XMLPlus_GetItemDataByNaaccrId(...)
              }
          }
      }
      
      XMLPlus_CloseXmlDataFile(...)
      XMLPlus_Exit(...)

      You might even hard-code some items for this test, e.g., naaccrNum=390 and naaccrId=”dateOfDiagnosis”. Let me know what happens.

      Thanks,
      Kathleen

      #9658
      AnonymousJeff Reed
      Spectator

      Thank you Kathleen,

      Your pseudo code is what I have in development so no problem testing. Looks like I have the problem using a pointer return variable to access the actual string. The bynumber() function I successfully used returns the value so I have no problem there, the byNaaccrId() returns a pointer and apparently I am rusty on using pointers to access the string value because I cant seem to get it working and I am not finding a good example.

      [DllImportAttribute(“XMLPlus.dll”, EntryPoint = “XMLPlus_GetItemDefByNumber“)]
      public static extern int XMLPlus_GetItemDefByNumber(int XmlId,
      int naaccrNum, System.IntPtr owner, System.IntPtr callback_func);

      [DllImportAttribute(“XMLPlus.dll”, EntryPoint = “XMLPlus_GetItemDefByNaaccrId“)]
      public static extern int XMLPlus_GetItemDefByNaaccrId(int XmlId,
      [InAttribute()] [MarshalAsAttribute(UnmanagedType.LPStr)] string naaccrId,
      System.IntPtr owner, System.IntPtr callback_func);

      #9661

      Jeff, I’m getting confused.

      I thought you were working with the functions to read the data file, but the declarations you just quoted are for reading the dictionary.

      When reading the data file, you pass the callback parameters (owner, callback_func) to the XMLPlus_OpenXmlDataFile function. My thinking was that you’ll be calling the “XMLPlus_GetItemDataBy[NaaccrId or NaaccrNum]” function gazillions of times, so just post the callback once for the entire run. (There is some set-up involved with posting a callback in the DLL, not costly, but why waste cycles? Reasonable people can argue about my design choices…)

      Looks like I have the problem using a pointer return variable to access the actual string.

      The “GetItemDataBy” functions don’t retrieve the data value directly; instead, the DLL locates the value by the identifier, and then sends the “answer” back through the callback function. I have a very good reason for this design choice: If you expected the DLL to populate a reference variable supplied in your direct call, you would have to first ask how long that string value is so that you could size your string appropriately (otherwise, danger of buffer overflow!). But through the magic of pointers, the DLL is maintaining the contents of the requested item in its local memory long enough to stream it to your callback function, where you can capture it to a C# string and manage that memory yourself.

      Take another look at the C++ sample, “Run EDITS on NAACCR XML Data File”, and look for XmlToFlatReadItemByNaaccrNum and XmlToFlatReadItemByNaaccrId. Particularly for preparing the EDITS buffer dynamically, you can see that more steps are required when you have to do the look-up by naaccrId, but it can be done.

      Kathleen

    Viewing 15 posts - 16 through 30 (of 38 total)
    • You must be logged in to reply to this topic.

    Copyright © 2018 NAACCR, Inc. All Rights Reserved | naaccr-swoosh-only See NAACCR Partners and Sponsors