Email Fetching Duplicate Messages

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • cardmaverick
    Member
    • Jan 2018
    • 30

    #16
    I've also noticed some messages have 'null' for message_id column in the database.

    Comment

    • yuri
      Member
      • Mar 2014
      • 8440

      #17
      Maybe the cron script fails, and then re-starts that brings about duplicates. It needs to be investigated. How much personal email accounts do you have in the system?

      We've been using espocrm for years and have never encountered email duplicates.
      If you find EspoCRM good, we would greatly appreciate if you could give the project a star on GitHub. We believe our work truly deserves more recognition. Thanks.

      Comment

      • yuri
        Member
        • Mar 2014
        • 8440

        #18
        Do you use EspoCRM of the version 5.0.3?
        If you find EspoCRM good, we would greatly appreciate if you could give the project a star on GitHub. We believe our work truly deserves more recognition. Thanks.

        Comment

        • yuri
          Member
          • Mar 2014
          • 8440

          #19
          I've added some fixes to mail importer class, that I believe should solve the issue that email being imported in parallel processes can cause duplicates.

          Could you apply the changed file to your instance manually to check whether it helps?

          https://raw.githubusercontent.com/es...l/Importer.php
          If you find EspoCRM good, we would greatly appreciate if you could give the project a star on GitHub. We believe our work truly deserves more recognition. Thanks.

          Comment


          • joy11
            joy11 commented
            Editing a comment
            I added this update and it doesn't seem to help the issue. I have 2 group emails and approximately 30 personal emails in the system. The personal emails have inbox and sent items monitored. We're using Gmail for business.

            I ran it for approximately 3 hours today and noticed in my mySQL processes (live view) there were several (7 or so) open sleeping connections with +100 time listed. I was still getting duplicate and triplicate emails in the system and sending from the system. It appeared to be mainly on personal emails. Gmail was also having some issues this afternoon, but when I reverted back to the original Importer.php it seemed to improve.

            I've been having random duplicate emails for a while, but very few and far between. Today, before the change to Importer.php I had 15 duplicates within 1 hour. I changed the Importer.php and was fine for about an hour then had 17 duplicate/triplicates the following hour.

            I was unable to find the spot in the data/config.php to change anything about the cron execution. Maybe I'm looking in the wrong spot?
            Last edited by joy11; 01-25-2018, 09:51 PM.

          • yuri
            yuri commented
            Editing a comment
            Do these duplicates have an identical message id?

            What max email portion size is set in Administation > Inbound Emails?

            Thanks.

          • joy11
            joy11 commented
            Editing a comment
            The emails do have identical message_id fields in the database. I have the max email portion size set to 20 on personal and 20 on group. I turned off most of my workflow jobs to try and improve performance. This morning I made a change to the 'minExecutionTime' => 60, (before I saw your post below) It didn't seem to hurt anything during the day and I had only 1 duplicate for several hours. This evening I did upgrade to 5.0.3 and have seen several duplicates since the upgrade. I will wait for next release now since my upgrade is clean.
        • cardmaverick
          Member
          • Jan 2018
          • 30

          #20
          Version 5.0.3

          Max email portion size for personal account fetching: 4,000 (I used 10 originally, but I bumped it up when I realized how long import would take with ~10K messages).

          All dependencies met at install, I also added in php mailparse before bringing messages into the system (not mentioned in the installer dependency section if I remember right).

          I was only using one personal email account - I did have my SMPT info entered into every place possible in the CRM though - perhaps that's contributing? I'm monitoring both Inbox and Sent folders. My email is provided by Godaddy - I use their Workspace Client online right now.

          I do develop - on the surface it does feel like it might be a bit of cron job failing mixed with inadequate data control in the database / processing script to check for duplicate messages. You could create some kind of hash column in the database and create the hash based on the actual message, then make the column a unique ID. The null message_id thing strikes me as being a quirk of email standards being all over the place. Every email I get from one company triggers an email format specification warning in the error logs.

          Hope that helps!

          Comment

          • cardmaverick
            Member
            • Jan 2018
            • 30

            #21
            You process email in parallel? I'm actually very familiar with PHP in parallel (I wrote an entire parallel processor for an internal program) - how are you processing in parallel? Are you dividing the work into piles? My own processor breaks each worker up with it's own pile of tasks to avoid race conditions, but when it comes to inserting into MYSQL - you can't do parallel inserts to the same database table - they are sequential no matter how many connections you have - if you are generating multiple connections, your wasting your time - it has no impact on insert performance. A better method is to parallel process the data outside the database - if anything can be done outside the database - then recombine edited data and do big insert statements of 5,000 or so records.

            Comment

            • cardmaverick
              Member
              • Jan 2018
              • 30

              #22
              You're database schema has no unique key on the 'message_id' column - that might be the issue here - assuming all emails have true unique message id's.

              Comment

              • yuri
                Member
                • Mar 2014
                • 8440

                #23
                I meant multiple cron running in paraller. One script is running, the other script starts before previous is finished. If you have only one personal account that should not happen unless the cron run takes more than 2 hours. This could have happen, in this case system treat the job as failed and starts again (for 3 times before terminate). These params are configurable in data/config.php

                I think you need:
                1. Drop max portion size to ~100.
                2. Apply the changed file I linked above.
                If you find EspoCRM good, we would greatly appreciate if you could give the project a star on GitHub. We believe our work truly deserves more recognition. Thanks.

                Comment

                • yuri
                  Member
                  • Mar 2014
                  • 8440

                  #24
                  Note, that you can still have the job running right now that can take some time to finish.

                  Cron config params:
                  PHP Code:
                   
                  'cron' => array(
                      /** Max number of jobs per one execution. */
                      'maxJobNumber' => 15,
                      
                       /** Max execution time (in seconds) allocated for a sinle job. If exceeded then set to Failed.*/
                      'jobPeriod' => 7800,    
                    
                      /** Attempts to re-run failed jobs. */
                      'attempts' => 2  
                  ) 
                  
                  If you find EspoCRM good, we would greatly appreciate if you could give the project a star on GitHub. We believe our work truly deserves more recognition. Thanks.

                  Comment


                  • joy11
                    joy11 commented
                    Editing a comment
                    Great thank you. My files have this code in this 'cron' section also:

                    /** Min time (in seconds) between two cron runs. */
                    'minExecutionTime' => 50,

                    Do you know if this is the time between two of the same job attempts? If it is, I'll bump this execution time up and that should help.

                  • yuri
                    yuri commented
                    Editing a comment
                    minExecutionTime shouldn't be touched. You can bump up jobPeriod. Also set attempts = 1
                    Last edited by yuri; 01-26-2018, 07:15 PM.

                  • yuri
                    yuri commented
                    Editing a comment
                    I'd recommend you to wait for the next release. We are going to improve scheduled jobs implementation.
                • yuri
                  Member
                  • Mar 2014
                  • 8440

                  #25
                  Applying inquie key we can't afford for now for the next reasons:
                  1. Users can already have null and dupllicates in their database.
                  2. Upgrade will take way too long, since email table are usually big.

                  You can apply it manually for your database. Exception will be catched and no fatal error should occur.
                  If you find EspoCRM good, we would greatly appreciate if you could give the project a star on GitHub. We believe our work truly deserves more recognition. Thanks.

                  Comment

                  • yuri
                    Member
                    • Mar 2014
                    • 8440

                    #26
                    Ok. Now we are working on this issue. There are two points:

                    1. Make an ability for jobs to be executed in a long period. More than 2 hours like it is now. We will need to control PID. At lease we can pull it off for unix-based systems.
                    2. Prevent email duplicates that can occur in extreme conditions, e.g. when multiple processes are executed simultaneously and server run out of resources.
                    If you find EspoCRM good, we would greatly appreciate if you could give the project a star on GitHub. We believe our work truly deserves more recognition. Thanks.

                    Comment

                    • cardmaverick
                      Member
                      • Jan 2018
                      • 30

                      #27
                      I was looking at your importer code - are you looking for duplicates based strictly on message ID?

                      protected function findDuplicate(Entity $email)
                      {
                      if ($email->get('messageId')) {
                      $duplicate = $this->getEntityManager()->getRepository('Email')->where(array(
                      'messageId' => $email->get('messageId')
                      ))->findOne();
                      if ($duplicate) {
                      return $duplicate;
                      }
                      }
                      }

                      A system of hashing out the actual message content into a 255 character key might work better, I saw a ton of NULL message ID's before disallowing null IDs in the database message id column - unless there is an issue deeper in your program that is erasing them.

                      Comment

                      • cardmaverick
                        Member
                        • Jan 2018
                        • 30

                        #28

                        So I just got off the phone with Godaddy - I'm not sure if this is the issue or not, but my email is very old, it used to be on a POP3 Server but they migrated everyone to an IMAP server. So a lot of my mailbox at one point in time was handled by POP3, I don't know if this affects email formatting but I figure it's worth mentioning. I'm looking at having to just cut ties with the box and move to something more modern and *clean slate*.

                        Comment

                        • cardmaverick
                          Member
                          • Jan 2018
                          • 30

                          #29
                          I just dug in deeper here, looking at message headers in my workspace email client to see if messages with null message-id fields were missing the 'Message-Id' header, they are.

                          I also upgraded the importer.php file and reverted back to old database index / column settings for message-id column - tons of duplicates entering the system.

                          Comment

                          • yuri
                            Member
                            • Mar 2014
                            • 8440

                            #30
                            Hi,

                            Have you decreased max portion size? The aim is a portion fetched in less than 2 hours.

                            Also you can bump up jobPeriod.
                            If you find EspoCRM good, we would greatly appreciate if you could give the project a star on GitHub. We believe our work truly deserves more recognition. Thanks.

                            Comment

                            Working...